CS4132 Data Analytics
- Analyzing the Impact of Government Expenditure on Education on Economic
and PsychologicalIndicators by Chandranshu Nanda
Background¶
Global Context¶
Education has long been recognized as a cornerstone for social and economic development. Globally, governments invest heavily in education, viewing it as a means to foster economic growth, reduce poverty, and promote social equity. The link between education and economic development was notably emphasized in the early 20th century with the rise of human capital theory, which posits that investment in education enhances individual productivity and, by extension, national economic performance.
Singapore's Context¶
Singapore provides a unique case study in the global landscape of education. Since gaining independence in 1965, Singapore has consistently prioritized education as a key driver of its economic transformation. The government’s strategic investments in education have played a pivotal role in transforming the nation from a developing country to a global economic powerhouse within a few decades.
The Singaporean government allocates a significant portion of its budget to education, aiming to provide high-quality education that equips its citizens with the skills necessary for the country’s knowledge-based economy. Policies such as the "Teach Less, Learn More" initiative, the emphasis on STEM education, and the creation of autonomous universities reflect Singapore’s commitment to continually evolving its educational system to meet the demands of a globalized world.
The outcomes of these investments are evident: Singapore consistently ranks at the top of international assessments like PISA (Programme for International Student Assessment), and its workforce is highly skilled and competitive.
Current State¶
Today, the global focus on education continues to evolve. Governments worldwide are increasingly recognizing the need to invest not only in traditional education but also in areas like mental health, digital literacy, and lifelong learning. The COVID-19 pandemic has further highlighted the importance of resilient and adaptable education systems.
In Singapore, education remains at the forefront of national policy. The government continues to adapt its education strategies to prepare its citizens for the challenges of the future, such as the digital economy and the need for innovation.
Motivation¶
This research on the impact of government expenditure on education, particularly in relation to economic and psychological indicators, is crucial and timely for several reasons:
Post-Pandemic Recovery and Resilience¶
The COVID-19 pandemic has profoundly disrupted educational systems worldwide, exacerbating existing inequalities and challenging the effectiveness of traditional education models. As countries, including Singapore, navigate post-pandemic recovery, understanding how education expenditure can contribute to both economic resilience and psychological well-being is vital. This research can inform policies that ensure education systems are robust and adaptable, helping societies recover more effectively from global disruptions.
Economic Uncertainty and Global Competition¶
In a rapidly changing global economy, where technological advancements and digitalization are reshaping industries, investing in education is more important than ever. Governments are under pressure to produce a workforce that is not only skilled but also adaptable to future economic shifts. This research can provide insights into how strategic investment in education can drive economic growth and maintain a competitive edge in the global market, particularly for countries like Singapore that rely on a knowledge-based economy.
Policy Development and Innovation¶
For policymakers, understanding the multifaceted impacts of education spending is essential for developing informed and effective policies. This research can provide evidence-based insights that guide the allocation of resources towards educational initiatives that yield the greatest economic and psychological benefits. In Singapore and beyond, such insights are crucial for ensuring that education systems evolve to meet contemporary challenges and opportunities.
Conclusion¶
In a world that is increasingly interconnected and facing unprecedented challenges, this research is necessary to explore how government expenditure on education can be optimized to support not only economic development but also the psychological well-being of populations. By addressing these critical issues now, the research can contribute to shaping education policies that are both forward-looking and responsive to current global trends.
What is the
quantitative impact ofcorrelation between government expenditure on educationonand GDPgrowth?- A country's GDP has a very strong correlation with its Expenditure on Education
- The Percentage of GDP a country decides to invest in education has a correlation with its GDP per capita. As a country gets bigger and more developed, this correlation gets stronger.
To what extent does government expenditure on education reduce income inequality, as measured by the Gini coefficient?
- A country's Educational Expenditure per Capita has a strong negative correlation on its Gini Index
- All countries with Gini Index greater than the 75th percentile have very low Educational Expenditure per Capita
What predictive trends can be identified regarding the future impact of current education spending in Singapore on the
psychologicaleconomical well-being of its population?- As Educational Expenditure increases, predicted GDP per Capita in 2073 also increases.
- For every increase of 1% in Educational Expenditure (% GDP), the forecasted GDP per Capita in 2073 increases by $1382
- As Educational Expenditure increases, predicted Gini Index in 2073 decreases.
- For every increase of 1% in Educational Expenditure (% GDP), the forecasted Gini Index in 2073 decreases by 2
- At Singapore's current Educational Expenditure (% GDP) (2.15%), the forecasted GDP per Capita in 2073 is $88507 USD
- If Singapore were to spend the mean Educational Expenditure (% GDP) (4%), the forecasted GDP per Capita in 2073 is $90299 USD
- At Singapore's current Educational Expenditure (% GDP) (2.15%), the forecasted Gini Index in 2073 is 32.5
- If Singapore were to spend the mean Educational Expenditure (% GDP) (4%), the forecasted GDP per Capita in 2073 is 29.0
- As Educational Expenditure increases, predicted GDP per Capita in 2073 also increases.
For each link, briefly describe the dataset that will be used in the project.
The data must be real — neither you nor someone else may make up the data.
https://www.iban.com/country-codes
- (Global), ISO Codes
https://tablebuilder.singstat.gov.sg/table/TS/M850011#
- (SG) Government Expenditure On Education, from 1960 - 2022
https://tablebuilder.singstat.gov.sg/table/TS/M850581#
- (SG) Residents Aged 25 Years & Over By Highest Qualification Attained, Sex And Age Group, from 1990 - 2023
https://tablebuilder.singstat.gov.sg/table/TS/M850671#
- (SG) Proportion Of Students Achieving At Least A Minimum Proficiency In Reading And Mathematics By Level And Sex, from 1995 - 2022
https://tablebuilder.singstat.gov.sg/table/TS/M850691#
- (SG) Total Net Enrolment Rate, Gender Parity Index, And Pupil-Teacher Ratio For Primary And Secondary Education, from 2016 - 2022
https://tablebuilder.singstat.gov.sg/table/TS/M183401
- (SG) Average Resident Unemployment Rate By Sex, Age And Highest Qualification Attained, from 1992 - 2023
https://tablebuilder.singstat.gov.sg/table/TS/M015651
- (SG) Gross Domestic Product At Current Prices, By Industry (SSIC 2020), from 1975 - 2024
https://tablebuilder.singstat.gov.sg/table/CT/17892
- (SG), Gini Coefficient Among Resident Employed Households, from 2000 - 2023
-
- (Global), Demographic and socio-economic: Socio-economic indicators, from 2017 - 2023
-
- (Global), Other policy relevant indicators : Government expenditure on education in US$, from 2017 - 2023
https://worldhappiness.report/data/
- (Global), World Happiness Report Appendices & Data, from 2005 - 2023
https://data.worldbank.org/indicator/SE.XPD.TOTL.GD.ZS?_gl=1
- (Global), Government expenditure on education, total (% of GDP)
https://data.worldbank.org/indicator/NY.GDP.MKTP.KD.ZG?_gl=1
- (Global), GDP growth (annual %)
https://data.worldbank.org/indicator/NY.GDP.MKTP.CD
- (Global), GDP (Current US$)
https://data.worldbank.org/indicator/SP.POP.TOTL?name_desc=false
- (Global), Total Population
https://data.worldbank.org/indicator/SI.POV.GINI?year=1999
- (Global), Gini Index
https://data.worldbank.org/indicator/SI.DST.10TH.10
- (Global), Income share held by highest 10%
https://data.worldbank.org/indicator/SI.DST.05TH.20
- (Global), Income share held by highest 20%
https://data.worldbank.org/indicator/SI.DST.FRST.20
- (Global), Income share held by lowest 20%
https://data.worldbank.org/indicator/SI.DST.FRST.10
- (Global), Income share held by lowest 10%
Below, we demonstrate the setup needed to run the following sections.
Below, we install all the required libraries.
! pip install pandas
! pip install matplotlib
! pip install numpy
! pip install requests
! pip install beautifulsoup4
! pip install ipython
! pip install plotly
! pip install seaborn
! pip install scipy
! pip install scikit-learn
! pip install nbconvert
Requirement already satisfied: pandas in /Users/chandranshunanda/Desktop/WebDev/CS4132/.venv/lib/python3.11/site-packages (2.2.3) Requirement already satisfied: numpy>=1.23.2 in /Users/chandranshunanda/Desktop/WebDev/CS4132/.venv/lib/python3.11/site-packages (from pandas) (2.1.1) Requirement already satisfied: python-dateutil>=2.8.2 in /Users/chandranshunanda/Desktop/WebDev/CS4132/.venv/lib/python3.11/site-packages (from pandas) (2.9.0.post0) Requirement already satisfied: pytz>=2020.1 in /Users/chandranshunanda/Desktop/WebDev/CS4132/.venv/lib/python3.11/site-packages (from pandas) (2024.2) Requirement already satisfied: tzdata>=2022.7 in /Users/chandranshunanda/Desktop/WebDev/CS4132/.venv/lib/python3.11/site-packages (from pandas) (2024.2) Requirement already satisfied: six>=1.5 in /Users/chandranshunanda/Desktop/WebDev/CS4132/.venv/lib/python3.11/site-packages (from python-dateutil>=2.8.2->pandas) (1.16.0) Requirement already satisfied: matplotlib in /Users/chandranshunanda/Desktop/WebDev/CS4132/.venv/lib/python3.11/site-packages (3.9.2) Requirement already satisfied: contourpy>=1.0.1 in /Users/chandranshunanda/Desktop/WebDev/CS4132/.venv/lib/python3.11/site-packages (from matplotlib) (1.3.0) Requirement already satisfied: cycler>=0.10 in /Users/chandranshunanda/Desktop/WebDev/CS4132/.venv/lib/python3.11/site-packages (from matplotlib) (0.12.1) Requirement already satisfied: fonttools>=4.22.0 in /Users/chandranshunanda/Desktop/WebDev/CS4132/.venv/lib/python3.11/site-packages (from matplotlib) (4.54.1) Requirement already satisfied: kiwisolver>=1.3.1 in /Users/chandranshunanda/Desktop/WebDev/CS4132/.venv/lib/python3.11/site-packages (from matplotlib) (1.4.7) Requirement already satisfied: numpy>=1.23 in /Users/chandranshunanda/Desktop/WebDev/CS4132/.venv/lib/python3.11/site-packages (from matplotlib) (2.1.1) Requirement already satisfied: packaging>=20.0 in /Users/chandranshunanda/Desktop/WebDev/CS4132/.venv/lib/python3.11/site-packages (from matplotlib) (24.1) Requirement already satisfied: pillow>=8 in /Users/chandranshunanda/Desktop/WebDev/CS4132/.venv/lib/python3.11/site-packages (from matplotlib) (10.4.0) Requirement already satisfied: pyparsing>=2.3.1 in /Users/chandranshunanda/Desktop/WebDev/CS4132/.venv/lib/python3.11/site-packages (from matplotlib) (3.1.4) Requirement already satisfied: python-dateutil>=2.7 in /Users/chandranshunanda/Desktop/WebDev/CS4132/.venv/lib/python3.11/site-packages (from matplotlib) (2.9.0.post0) Requirement already satisfied: six>=1.5 in /Users/chandranshunanda/Desktop/WebDev/CS4132/.venv/lib/python3.11/site-packages (from python-dateutil>=2.7->matplotlib) (1.16.0) Requirement already satisfied: numpy in /Users/chandranshunanda/Desktop/WebDev/CS4132/.venv/lib/python3.11/site-packages (2.1.1) Requirement already satisfied: requests in /Users/chandranshunanda/Desktop/WebDev/CS4132/.venv/lib/python3.11/site-packages (2.32.3) Requirement already satisfied: charset-normalizer<4,>=2 in /Users/chandranshunanda/Desktop/WebDev/CS4132/.venv/lib/python3.11/site-packages (from requests) (3.3.2) Requirement already satisfied: idna<4,>=2.5 in /Users/chandranshunanda/Desktop/WebDev/CS4132/.venv/lib/python3.11/site-packages (from requests) (3.10) Requirement already satisfied: urllib3<3,>=1.21.1 in /Users/chandranshunanda/Desktop/WebDev/CS4132/.venv/lib/python3.11/site-packages (from requests) (2.2.3) Requirement already satisfied: certifi>=2017.4.17 in /Users/chandranshunanda/Desktop/WebDev/CS4132/.venv/lib/python3.11/site-packages (from requests) (2024.8.30) Requirement already satisfied: beautifulsoup4 in /Users/chandranshunanda/Desktop/WebDev/CS4132/.venv/lib/python3.11/site-packages (4.12.3) Requirement already satisfied: soupsieve>1.2 in /Users/chandranshunanda/Desktop/WebDev/CS4132/.venv/lib/python3.11/site-packages (from beautifulsoup4) (2.6) Requirement already satisfied: ipython in /Users/chandranshunanda/Desktop/WebDev/CS4132/.venv/lib/python3.11/site-packages (8.27.0) Requirement already satisfied: decorator in /Users/chandranshunanda/Desktop/WebDev/CS4132/.venv/lib/python3.11/site-packages (from ipython) (5.1.1) Requirement already satisfied: jedi>=0.16 in /Users/chandranshunanda/Desktop/WebDev/CS4132/.venv/lib/python3.11/site-packages (from ipython) (0.19.1) Requirement already satisfied: matplotlib-inline in /Users/chandranshunanda/Desktop/WebDev/CS4132/.venv/lib/python3.11/site-packages (from ipython) (0.1.7) Requirement already satisfied: prompt-toolkit<3.1.0,>=3.0.41 in /Users/chandranshunanda/Desktop/WebDev/CS4132/.venv/lib/python3.11/site-packages (from ipython) (3.0.48) Requirement already satisfied: pygments>=2.4.0 in /Users/chandranshunanda/Desktop/WebDev/CS4132/.venv/lib/python3.11/site-packages (from ipython) (2.18.0) Requirement already satisfied: stack-data in /Users/chandranshunanda/Desktop/WebDev/CS4132/.venv/lib/python3.11/site-packages (from ipython) (0.6.3) Requirement already satisfied: traitlets>=5.13.0 in /Users/chandranshunanda/Desktop/WebDev/CS4132/.venv/lib/python3.11/site-packages (from ipython) (5.14.3) Requirement already satisfied: typing-extensions>=4.6 in /Users/chandranshunanda/Desktop/WebDev/CS4132/.venv/lib/python3.11/site-packages (from ipython) (4.12.2) Requirement already satisfied: pexpect>4.3 in /Users/chandranshunanda/Desktop/WebDev/CS4132/.venv/lib/python3.11/site-packages (from ipython) (4.9.0) Requirement already satisfied: parso<0.9.0,>=0.8.3 in /Users/chandranshunanda/Desktop/WebDev/CS4132/.venv/lib/python3.11/site-packages (from jedi>=0.16->ipython) (0.8.4) Requirement already satisfied: ptyprocess>=0.5 in /Users/chandranshunanda/Desktop/WebDev/CS4132/.venv/lib/python3.11/site-packages (from pexpect>4.3->ipython) (0.7.0) Requirement already satisfied: wcwidth in /Users/chandranshunanda/Desktop/WebDev/CS4132/.venv/lib/python3.11/site-packages (from prompt-toolkit<3.1.0,>=3.0.41->ipython) (0.2.13) Requirement already satisfied: executing>=1.2.0 in /Users/chandranshunanda/Desktop/WebDev/CS4132/.venv/lib/python3.11/site-packages (from stack-data->ipython) (2.1.0) Requirement already satisfied: asttokens>=2.1.0 in /Users/chandranshunanda/Desktop/WebDev/CS4132/.venv/lib/python3.11/site-packages (from stack-data->ipython) (2.4.1) Requirement already satisfied: pure-eval in /Users/chandranshunanda/Desktop/WebDev/CS4132/.venv/lib/python3.11/site-packages (from stack-data->ipython) (0.2.3) Requirement already satisfied: six>=1.12.0 in /Users/chandranshunanda/Desktop/WebDev/CS4132/.venv/lib/python3.11/site-packages (from asttokens>=2.1.0->stack-data->ipython) (1.16.0) Requirement already satisfied: plotly in /Users/chandranshunanda/Desktop/WebDev/CS4132/.venv/lib/python3.11/site-packages (5.24.1) Requirement already satisfied: tenacity>=6.2.0 in /Users/chandranshunanda/Desktop/WebDev/CS4132/.venv/lib/python3.11/site-packages (from plotly) (9.0.0) Requirement already satisfied: packaging in /Users/chandranshunanda/Desktop/WebDev/CS4132/.venv/lib/python3.11/site-packages (from plotly) (24.1) Requirement already satisfied: seaborn in /Users/chandranshunanda/Desktop/WebDev/CS4132/.venv/lib/python3.11/site-packages (0.13.2) Requirement already satisfied: numpy!=1.24.0,>=1.20 in /Users/chandranshunanda/Desktop/WebDev/CS4132/.venv/lib/python3.11/site-packages (from seaborn) (2.1.1) Requirement already satisfied: pandas>=1.2 in /Users/chandranshunanda/Desktop/WebDev/CS4132/.venv/lib/python3.11/site-packages (from seaborn) (2.2.3) Requirement already satisfied: matplotlib!=3.6.1,>=3.4 in /Users/chandranshunanda/Desktop/WebDev/CS4132/.venv/lib/python3.11/site-packages (from seaborn) (3.9.2) Requirement already satisfied: contourpy>=1.0.1 in /Users/chandranshunanda/Desktop/WebDev/CS4132/.venv/lib/python3.11/site-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (1.3.0) Requirement already satisfied: cycler>=0.10 in /Users/chandranshunanda/Desktop/WebDev/CS4132/.venv/lib/python3.11/site-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (0.12.1) Requirement already satisfied: fonttools>=4.22.0 in /Users/chandranshunanda/Desktop/WebDev/CS4132/.venv/lib/python3.11/site-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (4.54.1) Requirement already satisfied: kiwisolver>=1.3.1 in /Users/chandranshunanda/Desktop/WebDev/CS4132/.venv/lib/python3.11/site-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (1.4.7) Requirement already satisfied: packaging>=20.0 in /Users/chandranshunanda/Desktop/WebDev/CS4132/.venv/lib/python3.11/site-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (24.1) Requirement already satisfied: pillow>=8 in /Users/chandranshunanda/Desktop/WebDev/CS4132/.venv/lib/python3.11/site-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (10.4.0) Requirement already satisfied: pyparsing>=2.3.1 in /Users/chandranshunanda/Desktop/WebDev/CS4132/.venv/lib/python3.11/site-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (3.1.4) Requirement already satisfied: python-dateutil>=2.7 in /Users/chandranshunanda/Desktop/WebDev/CS4132/.venv/lib/python3.11/site-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (2.9.0.post0) Requirement already satisfied: pytz>=2020.1 in /Users/chandranshunanda/Desktop/WebDev/CS4132/.venv/lib/python3.11/site-packages (from pandas>=1.2->seaborn) (2024.2) Requirement already satisfied: tzdata>=2022.7 in /Users/chandranshunanda/Desktop/WebDev/CS4132/.venv/lib/python3.11/site-packages (from pandas>=1.2->seaborn) (2024.2) Requirement already satisfied: six>=1.5 in /Users/chandranshunanda/Desktop/WebDev/CS4132/.venv/lib/python3.11/site-packages (from python-dateutil>=2.7->matplotlib!=3.6.1,>=3.4->seaborn) (1.16.0) Requirement already satisfied: scipy in /Users/chandranshunanda/Desktop/WebDev/CS4132/.venv/lib/python3.11/site-packages (1.14.1) Requirement already satisfied: numpy<2.3,>=1.23.5 in /Users/chandranshunanda/Desktop/WebDev/CS4132/.venv/lib/python3.11/site-packages (from scipy) (2.1.1) Requirement already satisfied: scikit-learn in /Users/chandranshunanda/Desktop/WebDev/CS4132/.venv/lib/python3.11/site-packages (1.5.2) Requirement already satisfied: numpy>=1.19.5 in /Users/chandranshunanda/Desktop/WebDev/CS4132/.venv/lib/python3.11/site-packages (from scikit-learn) (2.1.1) Requirement already satisfied: scipy>=1.6.0 in /Users/chandranshunanda/Desktop/WebDev/CS4132/.venv/lib/python3.11/site-packages (from scikit-learn) (1.14.1) Requirement already satisfied: joblib>=1.2.0 in /Users/chandranshunanda/Desktop/WebDev/CS4132/.venv/lib/python3.11/site-packages (from scikit-learn) (1.4.2) Requirement already satisfied: threadpoolctl>=3.1.0 in /Users/chandranshunanda/Desktop/WebDev/CS4132/.venv/lib/python3.11/site-packages (from scikit-learn) (3.5.0) Requirement already satisfied: nbconvert in /Users/chandranshunanda/Desktop/WebDev/CS4132/.venv/lib/python3.11/site-packages (7.16.4) Requirement already satisfied: beautifulsoup4 in /Users/chandranshunanda/Desktop/WebDev/CS4132/.venv/lib/python3.11/site-packages (from nbconvert) (4.12.3) Requirement already satisfied: bleach!=5.0.0 in /Users/chandranshunanda/Desktop/WebDev/CS4132/.venv/lib/python3.11/site-packages (from nbconvert) (6.1.0) Requirement already satisfied: defusedxml in /Users/chandranshunanda/Desktop/WebDev/CS4132/.venv/lib/python3.11/site-packages (from nbconvert) (0.7.1) Requirement already satisfied: jinja2>=3.0 in /Users/chandranshunanda/Desktop/WebDev/CS4132/.venv/lib/python3.11/site-packages (from nbconvert) (3.1.4) Requirement already satisfied: jupyter-core>=4.7 in /Users/chandranshunanda/Desktop/WebDev/CS4132/.venv/lib/python3.11/site-packages (from nbconvert) (5.7.2) Requirement already satisfied: jupyterlab-pygments in /Users/chandranshunanda/Desktop/WebDev/CS4132/.venv/lib/python3.11/site-packages (from nbconvert) (0.3.0) Requirement already satisfied: markupsafe>=2.0 in /Users/chandranshunanda/Desktop/WebDev/CS4132/.venv/lib/python3.11/site-packages (from nbconvert) (2.1.5) Requirement already satisfied: mistune<4,>=2.0.3 in /Users/chandranshunanda/Desktop/WebDev/CS4132/.venv/lib/python3.11/site-packages (from nbconvert) (3.0.2) Requirement already satisfied: nbclient>=0.5.0 in /Users/chandranshunanda/Desktop/WebDev/CS4132/.venv/lib/python3.11/site-packages (from nbconvert) (0.10.0) Requirement already satisfied: nbformat>=5.7 in /Users/chandranshunanda/Desktop/WebDev/CS4132/.venv/lib/python3.11/site-packages (from nbconvert) (5.10.4) Requirement already satisfied: packaging in /Users/chandranshunanda/Desktop/WebDev/CS4132/.venv/lib/python3.11/site-packages (from nbconvert) (24.1) Requirement already satisfied: pandocfilters>=1.4.1 in /Users/chandranshunanda/Desktop/WebDev/CS4132/.venv/lib/python3.11/site-packages (from nbconvert) (1.5.1) Requirement already satisfied: pygments>=2.4.1 in /Users/chandranshunanda/Desktop/WebDev/CS4132/.venv/lib/python3.11/site-packages (from nbconvert) (2.18.0) Requirement already satisfied: tinycss2 in /Users/chandranshunanda/Desktop/WebDev/CS4132/.venv/lib/python3.11/site-packages (from nbconvert) (1.3.0) Requirement already satisfied: traitlets>=5.1 in /Users/chandranshunanda/Desktop/WebDev/CS4132/.venv/lib/python3.11/site-packages (from nbconvert) (5.14.3) Requirement already satisfied: six>=1.9.0 in /Users/chandranshunanda/Desktop/WebDev/CS4132/.venv/lib/python3.11/site-packages (from bleach!=5.0.0->nbconvert) (1.16.0) Requirement already satisfied: webencodings in /Users/chandranshunanda/Desktop/WebDev/CS4132/.venv/lib/python3.11/site-packages (from bleach!=5.0.0->nbconvert) (0.5.1) Requirement already satisfied: platformdirs>=2.5 in /Users/chandranshunanda/Desktop/WebDev/CS4132/.venv/lib/python3.11/site-packages (from jupyter-core>=4.7->nbconvert) (4.3.6) Requirement already satisfied: jupyter-client>=6.1.12 in /Users/chandranshunanda/Desktop/WebDev/CS4132/.venv/lib/python3.11/site-packages (from nbclient>=0.5.0->nbconvert) (7.4.9) Requirement already satisfied: fastjsonschema>=2.15 in /Users/chandranshunanda/Desktop/WebDev/CS4132/.venv/lib/python3.11/site-packages (from nbformat>=5.7->nbconvert) (2.20.0) Requirement already satisfied: jsonschema>=2.6 in /Users/chandranshunanda/Desktop/WebDev/CS4132/.venv/lib/python3.11/site-packages (from nbformat>=5.7->nbconvert) (4.23.0) Requirement already satisfied: soupsieve>1.2 in /Users/chandranshunanda/Desktop/WebDev/CS4132/.venv/lib/python3.11/site-packages (from beautifulsoup4->nbconvert) (2.6) Requirement already satisfied: attrs>=22.2.0 in /Users/chandranshunanda/Desktop/WebDev/CS4132/.venv/lib/python3.11/site-packages (from jsonschema>=2.6->nbformat>=5.7->nbconvert) (24.2.0) Requirement already satisfied: jsonschema-specifications>=2023.03.6 in /Users/chandranshunanda/Desktop/WebDev/CS4132/.venv/lib/python3.11/site-packages (from jsonschema>=2.6->nbformat>=5.7->nbconvert) (2023.12.1) Requirement already satisfied: referencing>=0.28.4 in /Users/chandranshunanda/Desktop/WebDev/CS4132/.venv/lib/python3.11/site-packages (from jsonschema>=2.6->nbformat>=5.7->nbconvert) (0.35.1) Requirement already satisfied: rpds-py>=0.7.1 in /Users/chandranshunanda/Desktop/WebDev/CS4132/.venv/lib/python3.11/site-packages (from jsonschema>=2.6->nbformat>=5.7->nbconvert) (0.20.0) Requirement already satisfied: entrypoints in /Users/chandranshunanda/Desktop/WebDev/CS4132/.venv/lib/python3.11/site-packages (from jupyter-client>=6.1.12->nbclient>=0.5.0->nbconvert) (0.4) Requirement already satisfied: nest-asyncio>=1.5.4 in /Users/chandranshunanda/Desktop/WebDev/CS4132/.venv/lib/python3.11/site-packages (from jupyter-client>=6.1.12->nbclient>=0.5.0->nbconvert) (1.6.0) Requirement already satisfied: python-dateutil>=2.8.2 in /Users/chandranshunanda/Desktop/WebDev/CS4132/.venv/lib/python3.11/site-packages (from jupyter-client>=6.1.12->nbclient>=0.5.0->nbconvert) (2.9.0.post0) Requirement already satisfied: pyzmq>=23.0 in /Users/chandranshunanda/Desktop/WebDev/CS4132/.venv/lib/python3.11/site-packages (from jupyter-client>=6.1.12->nbclient>=0.5.0->nbconvert) (24.0.1) Requirement already satisfied: tornado>=6.2 in /Users/chandranshunanda/Desktop/WebDev/CS4132/.venv/lib/python3.11/site-packages (from jupyter-client>=6.1.12->nbclient>=0.5.0->nbconvert) (6.4.1)
Below, we import all the necessary libraries
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import requests
from bs4 import BeautifulSoup
from IPython.display import display
import plotly.express as px
import seaborn as sns
from scipy import stats
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
Below, we do the setup for Plotly. Since Plotly uses Javascript, we must change the renderer when we export to HTML. Additionally, the notebook must be connected to wifi for the Plotly to work as intended.
import plotly.io as pio
pio.renderers.default = 'browser' # Change to 'browser' before exporting to html.
The data should be saved in .xlsx or .csv format to be submitted with the project. If webscraping has been done to obtain your data, save your webscraping code in another jupyter notebook as appendix to be submitted separately from the report.
Import and display each dataset in a dataframe.
For each dataset, give a brief overview of the data it contains, and explain the meaning of columns that are relevant to the project.
Cleaning MXXXXXX.csv¶
The following function loads and cleans CSV files from https://tablebuilder.singstat.gov.sg/
def load_and_clean_MXXXXXX(filename, skiprows, drop_last_rows):
# Load the CSV file, skipping the specified number of rows at the beginning
df = pd.read_csv(f'./data/{filename}.csv', skiprows=skiprows, index_col=0)
# Transpose the DataFrame and drop the specified number of rows from the end
df = df.T.iloc[:, :-drop_last_rows]
return df
This dataset covers ISO Codes. We will obtain the data via web scraping.
url = "https://www.iban.com/country-codes"
response = requests.get(url)
soup = BeautifulSoup(response.content, "html.parser")
table = soup.find("table", {"class": "table"})
headers = [header.text for header in table.find_all("th")]
rows = []
for row in table.find_all("tr")[1:]:
cells = row.find_all("td")
row_data = [cell.text for cell in cells]
rows.append(row_data)
country_codes_df = pd.DataFrame(rows, columns=headers)
country_codes_dict = country_codes_df.set_index('Alpha-3 code')[['Country']].to_dict()['Country']
country_codes_df
| Country | Alpha-2 code | Alpha-3 code | Numeric | |
|---|---|---|---|---|
| 0 | Afghanistan | AF | AFG | 004 |
| 1 | Åland Islands | AX | ALA | 248 |
| 2 | Albania | AL | ALB | 008 |
| 3 | Algeria | DZ | DZA | 012 |
| 4 | American Samoa | AS | ASM | 016 |
| ... | ... | ... | ... | ... |
| 244 | Wallis and Futuna | WF | WLF | 876 |
| 245 | Western Sahara | EH | ESH | 732 |
| 246 | Yemen | YE | YEM | 887 |
| 247 | Zambia | ZM | ZMB | 894 |
| 248 | Zimbabwe | ZW | ZWE | 716 |
249 rows × 4 columns
M850011.csv contains all of the information related to government expenditure on education in Singapore over multiple years. This file includes a detailed breakdown of the government's education spending, allowing for analysis of trends and impacts across different education levels and time periods.
df1 = load_and_clean_MXXXXXX('M850011', 9, 31)
df1
| Data Series | Total Government Expenditure On Education (Thousand Dollars) | Operating Expenditure (Thousand Dollars) | Primary Schools (Thousand Dollars) | Secondary Schools & Junior Colleges (Thousand Dollars) | Institute Of Technical Education (Thousand Dollars) | Tertiary (Thousand Dollars) | Universities (Thousand Dollars) | National Institute Of Education (Thousand Dollars) | Polytechnics (Thousand Dollars) | Others (Thousand Dollars) | Development Expenditure (Thousand Dollars) | Proportion Of Government Expenditure On Education To GDP (Per Cent) |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2022 | 13060530 | 12890289 | 3143296 | 3200498 | 491147 | 4304619 | 3047510 | 125976 | 1131133 | 1750729 | 170241 | na |
| 2021 | 12909908.0 | 12604777.0 | 2944535.0 | 3084162.0 | 498915.0 | 4225708.0 | 2939737.0 | 123306.0 | 1162665.0 | 1851457.0 | 305131.0 | 2.2 |
| 2020 | 12259853.0 | 11766853.0 | 2674257.0 | 2880658.0 | 470521.0 | 4155545.0 | 2906300.0 | 122227.0 | 1127018.0 | 1585872.0 | 493000.0 | 2.5 |
| 2019 | 12723276.0 | 11932177.0 | 2738444.0 | 3054241.0 | 473599.0 | 4307399.0 | 3008764.0 | 124176.0 | 1174459.0 | 1358494.0 | 791099.0 | 2.5 |
| 2018 | 12875992.0 | 12429006.0 | 2823567.0 | 3176690.0 | 489278.0 | 4608243.0 | 3243605.0 | 105071.0 | 1259567.0 | 1331228.0 | 446986.0 | 2.5 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 1964 | 120229 | 103358 | 61337 | 21212 | 3376 | 12735 | 9255 | 1630 | 1850 | 4698 | 16872 | na |
| 1963 | 107888 | 94644 | 58581 | 17206 | 2362 | 12277 | 8744 | 1533 | 2000 | 4218 | 13244 | na |
| 1962 | 94791 | 82307 | 50835 | 14751 | 1217 | 11248 | 8335 | 1318 | 1595 | 4256 | 12484 | na |
| 1961 | 74557 | 67650 | 43908 | 12466 | 1233 | 6684 | 3689 | 1200 | 1795 | 3359 | 6907 | na |
| 1960 | 61403 | 57100 | 37259 | 10409 | 765 | 5816 | 3144 | 1440 | 1233 | 2850 | 4303 | na |
63 rows × 12 columns
M850581.csv contains all of the information related to population distribution by age and gender in Singapore, particularly focusing on the breakdown of educational attainment levels across different age groups. This file provides a comprehensive overview of how different educational attainment levels are distributed across various age groups and genders in Singapore.
df2 = load_and_clean_MXXXXXX('M850581', 10, 21)
df2
| Data Series | Total | 25 - 29 Years | 30 - 34 Years | 35 - 39 Years | 40 - 44 Years | 45 - 49 Years | 50 - 54 Years | 55 - 59 Years | 60 - 64 Years | 65 Years & Over | ... | Total Females - University | 25 - 29 Years | 30 - 34 Years | 35 - 39 Years | 40 - 44 Years | 45 - 49 Years | 50 - 54 Years | 55 - 59 Years | 60 - 64 Years | 65 Years & Over |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2023 | 3100900.0 | 238000.0 | 307600.0 | 298300.0 | 313600.0 | 310700.0 | 317800.0 | 312500.0 | 292400.0 | 710100.0 | ... | 573500.0 | 76800.0 | 103600.0 | 94600.0 | 93000.0 | 73800.0 | 56400.0 | 35500.0 | 19600.0 | 20300.0 |
| 2022 | 3085300.0 | 215400.0 | 305500.0 | 308600.0 | 333100.0 | 343400.0 | 304100.0 | 310700.0 | 295300.0 | 669200.0 | ... | 552400.0 | 68500.0 | 99100.0 | 92400.0 | 94600.0 | 76900.0 | 50800.0 | 32200.0 | 16500.0 | 21400.0 |
| 2021 | 2978300.0 | 254800.0 | 325600.0 | 310400.0 | 309100.0 | 302000.0 | 260600.0 | 286200.0 | 302500.0 | 627000.0 | ... | 534200.0 | 83500.0 | 108400.0 | 93600.0 | 85100.0 | 63300.0 | 40500.0 | 24800.0 | 17000.0 | 17900.0 |
| 2020 | 2977800.0 | 268900.0 | 296000.0 | 297800.0 | 298100.0 | 311500.0 | 294200.0 | 306300.0 | 286000.0 | 619000.0 | ... | 496600.0 | 86900.0 | 94600.0 | 87600.0 | 78600.0 | 60700.0 | 37600.0 | 23600.0 | 13700.0 | 13300.0 |
| 2019 | 2922700.0 | 252300.0 | 273200.0 | 296900.0 | 304500.0 | 323000.0 | 311900.0 | 316100.0 | 266200.0 | 578600.0 | ... | 470500.0 | 77400.0 | 83800.0 | 88300.0 | 78000.0 | 61200.0 | 34400.0 | 23000.0 | 10700.0 | 13600.0 |
| 2018 | 2877000.0 | 251800.0 | 263800.0 | 303400.0 | 313100.0 | 313200.0 | 316600.0 | 299000.0 | 274800.0 | 541400.0 | ... | 440400.0 | 76000.0 | 79000.0 | 86300.0 | 73500.0 | 53200.0 | 31300.0 | 18100.0 | 12000.0 | 10900.0 |
| 2017 | 2846000.0 | 242300.0 | 262300.0 | 305700.0 | 320000.0 | 313700.0 | 307800.0 | 285000.0 | 261700.0 | 547500.0 | ... | 420500.0 | 72200.0 | 81500.0 | 84600.0 | 68900.0 | 51400.0 | 28700.0 | 15600.0 | 8400.0 | 9200.0 |
| 2016 | 2795200.0 | 253000.0 | 268200.0 | 288900.0 | 304000.0 | 301100.0 | 321700.0 | 299500.0 | 261400.0 | 497300.0 | ... | 398700.0 | 76900.0 | 80200.0 | 76600.0 | 62100.0 | 43800.0 | 27800.0 | 14900.0 | 7600.0 | 8900.0 |
| 2015 | 2750900.0 | 232800.0 | 269900.0 | 300200.0 | 315300.0 | 301500.0 | 319400.0 | 296700.0 | 248200.0 | 467000.0 | ... | 374800.0 | 67200.0 | 76800.0 | 77700.0 | 60800.0 | 39100.0 | 23600.0 | 13900.0 | 7900.0 | 8000.0 |
| 2014 | 2732600.0 | 231700.0 | 276900.0 | 287900.0 | 305100.0 | 298600.0 | 318300.0 | 296400.0 | 246200.0 | 471500.0 | ... | 362400.0 | 67100.0 | 78200.0 | 70500.0 | 56900.0 | 38300.0 | 23900.0 | 12900.0 | 7300.0 | 7200.0 |
| 2013 | 2669100.0 | 223700.0 | 274600.0 | 297500.0 | 310300.0 | 315100.0 | 318900.0 | 280200.0 | 229400.0 | 419500.0 | ... | 350000.0 | 63900.0 | 79100.0 | 72200.0 | 54400.0 | 33800.0 | 22800.0 | 12700.0 | 6200.0 | 5000.0 |
| 2012 | 2626400.0 | 223200.0 | 271700.0 | 302300.0 | 310800.0 | 319200.0 | 313200.0 | 272100.0 | 219700.0 | 394100.0 | ... | 320700.0 | 60200.0 | 75400.0 | 68700.0 | 50500.0 | 28600.0 | 17100.0 | 11000.0 | 4800.0 | 4500.0 |
| 2011 | 2601900.0 | 230900.0 | 280100.0 | 306900.0 | 302300.0 | 321700.0 | 309600.0 | 267400.0 | 209400.0 | 373700.0 | ... | 307300.0 | 61200.0 | 74900.0 | 65200.0 | 45100.0 | 27700.0 | 15000.0 | 8700.0 | 4800.0 | 4700.0 |
| 2010 | 2576000.0 | 249400.0 | 289300.0 | 317300.0 | 303500.0 | 319600.0 | 303600.0 | 252800.0 | 196400.0 | 344100.0 | ... | 288100.0 | 64700.0 | 72100.0 | 60400.0 | 38000.0 | 24400.0 | 13300.0 | 7500.0 | 4200.0 | 3500.0 |
| 2009 | 2523000.0 | 237800.0 | 280200.0 | 312900.0 | 311900.0 | 319000.0 | 298400.0 | 233300.0 | 182100.0 | 347400.0 | ... | 266400.0 | 60400.0 | 69500.0 | 57100.0 | 33300.0 | 21200.0 | 12000.0 | 6400.0 | 3300.0 | 3300.0 |
| 2008 | 2441300.0 | 228000.0 | 272900.0 | 307900.0 | 314400.0 | 316000.0 | 290000.0 | 228100.0 | 160300.0 | 323700.0 | ... | 246300.0 | 55500.0 | 64700.0 | 51700.0 | 32200.0 | 21000.0 | 10200.0 | 5800.0 | 2400.0 | 2800.0 |
| 2007 | 2448400.0 | 231800.0 | 288000.0 | 304300.0 | 317600.0 | 324300.0 | 284900.0 | 229700.0 | 144000.0 | 323600.0 | ... | 226300.0 | 52500.0 | 62500.0 | 45400.0 | 28100.0 | 16600.0 | 10800.0 | 5100.0 | 2700.0 | 2600.0 |
| 2006 | 2408600.0 | 231200.0 | 282000.0 | 298600.0 | 321800.0 | 322600.0 | 276000.0 | 224000.0 | 130800.0 | 321500.0 | ... | 212700.0 | 52800.0 | 56500.0 | 41200.0 | 27300.0 | 16400.0 | 8800.0 | 4800.0 | 2300.0 | 2500.0 |
| 2005 | 2308500.0 | 235700.0 | 288000.0 | 302800.0 | 319400.0 | 300300.0 | 256800.0 | 201200.0 | 120000.0 | 284300.0 | ... | 180500.0 | 48800.0 | 49400.0 | 34400.0 | 22400.0 | 11900.0 | 7000.0 | 3400.0 | 1600.0 | 1700.0 |
| 2004 | 2282400.0 | 230000.0 | 281700.0 | 292400.0 | 322400.0 | 303000.0 | 250500.0 | 190300.0 | 127900.0 | 284100.0 | ... | 172800.0 | 47000.0 | 44300.0 | 32200.0 | 22700.0 | 13300.0 | 6000.0 | 3800.0 | 1600.0 | 1900.0 |
| 2003 | 2263000.0 | 232400.0 | 287200.0 | 308500.0 | 314800.0 | 293800.0 | 243300.0 | 170500.0 | 132400.0 | 280100.0 | ... | 158600.0 | 46000.0 | 42000.0 | 28800.0 | 18000.0 | 11900.0 | 6200.0 | 2900.0 | 1400.0 | 1400.0 |
| 2002 | 2198900.0 | 233400.0 | 281000.0 | 320100.0 | 313900.0 | 281500.0 | 235100.0 | 150100.0 | 127700.0 | 256100.0 | ... | 140000.0 | 42500.0 | 39100.0 | 27200.0 | 13400.0 | 9000.0 | 4700.0 | 2000.0 | 1200.0 | 800.0 |
| 2001 | 2126900.0 | 242800.0 | 278000.0 | 318800.0 | 311700.0 | 267900.0 | 223400.0 | 121000.0 | 121600.0 | 241900.0 | ... | 121200.0 | 37000.0 | 32200.0 | 23000.0 | 13300.0 | 7800.0 | 3900.0 | 1900.0 | 1300.0 | 800.0 |
| 2000 | 2074000.0 | 250500.0 | 279500.0 | 310700.0 | 304500.0 | 256800.0 | 204900.0 | 125000.0 | 110700.0 | 231300.0 | ... | 105100.0 | 35000.0 | 27100.0 | 18800.0 | 11200.0 | 6400.0 | 3400.0 | 1600.0 | 900.0 | 800.0 |
| 1999 | 2067300.0 | 251500.0 | 293400.0 | 309900.0 | 300300.0 | 251800.0 | 186100.0 | 131900.0 | 113400.0 | 229000.0 | ... | 94100.0 | 31700.0 | 24800.0 | 15900.0 | 10200.0 | 5700.0 | 2900.0 | 1500.0 | 800.0 | 700.0 |
| 1998 | 2020800.0 | 242200.0 | 290000.0 | 307700.0 | 299600.0 | 252200.0 | 171400.0 | 132100.0 | 102000.0 | 223700.0 | ... | 87500.0 | 28400.0 | 21800.0 | 15400.0 | 9900.0 | 6000.0 | 2900.0 | 1200.0 | 900.0 | 900.0 |
| 1997 | 1969700.0 | 248300.0 | 293000.0 | 303400.0 | 286200.0 | 244500.0 | 154900.0 | 133000.0 | 99100.0 | 207300.0 | ... | 75500.0 | 24700.0 | 19800.0 | 12700.0 | 8600.0 | 5300.0 | 2200.0 | 1400.0 | 400.0 | 500.0 |
| 1996 | 1916400.0 | 240700.0 | 298200.0 | 305800.0 | 278700.0 | 233900.0 | 130900.0 | 127800.0 | 95600.0 | 204600.0 | ... | 69100.0 | 23500.0 | 18800.0 | 11000.0 | 7200.0 | 4200.0 | 2000.0 | 1300.0 | 400.0 | 800.0 |
| 1995 | 1860900.0 | 239900.0 | 292400.0 | 301200.0 | 264400.0 | 211300.0 | 132800.0 | 121700.0 | 97200.0 | 200000.0 | ... | 54300.0 | 18400.0 | 14700.0 | 9400.0 | 5400.0 | 3200.0 | 1700.0 | 800.0 | 300.0 | 500.0 |
| 1994 | 1816800.0 | 256100.0 | 291000.0 | 291500.0 | 248300.0 | 190400.0 | 142200.0 | 118900.0 | 96000.0 | 182500.0 | ... | 49600.0 | 17300.0 | 12800.0 | 7900.0 | 5300.0 | 2700.0 | 1600.0 | 1000.0 | 500.0 | 500.0 |
| 1993 | 1768400.0 | 255300.0 | 283400.0 | 278300.0 | 248900.0 | 175700.0 | 141900.0 | 113700.0 | 95100.0 | 176100.0 | ... | 44500.0 | 15800.0 | 11500.0 | 6600.0 | 4900.0 | 2900.0 | 1600.0 | 600.0 | 300.0 | 300.0 |
| 1992 | 1730600.0 | 271400.0 | 292200.0 | 266700.0 | 232100.0 | 155100.0 | 145000.0 | 108300.0 | 93200.0 | 166600.0 | ... | 37100.0 | 13400.0 | 10100.0 | 6200.0 | 3200.0 | 1900.0 | 1000.0 | 800.0 | 100.0 | 300.0 |
| 1991 | 1679300.0 | 272100.0 | 288300.0 | 259900.0 | 226500.0 | 137700.0 | 136300.0 | 107200.0 | 89600.0 | 161700.0 | ... | 31400.0 | 12600.0 | 7400.0 | 4700.0 | 3400.0 | 1600.0 | 900.0 | 500.0 | 300.0 | 100.0 |
| 1990 | 1622300.0 | 281500.0 | 292700.0 | 252600.0 | 203900.0 | 127500.0 | 117500.0 | 99800.0 | 82800.0 | 164100.0 | ... | 28900.0 | 10900.0 | 7400.0 | 4700.0 | 3000.0 | 1400.0 | 800.0 | 300.0 | 200.0 | 200.0 |
34 rows × 180 columns
M850671.csv contains all of the information related to the proficiency levels of students in primary and secondary education in Singapore, specifically focusing on their achievements in reading and mathematics. This file provides a detailed view of the educational outcomes in terms of reading and mathematics proficiency among students in Singapore's primary and secondary education systems.
df3=load_and_clean_MXXXXXX('M850671', 10, 24)
df3
| Data Series | Proportion Of Students In Primary Education Achieving At Least A Minimum Proficiency Level In Reading | Male | Female | Proportion Of Students In Secondary Education Achieving At Least A Minimum Proficiency Level In Reading | Male | Female | Proportion Of Students In Primary Education Achieving At Least A Minimum Proficiency Level In Mathematics | Male | Female | Proportion Of Students In Secondary Education Achieving At Least A Minimum Proficiency Level In Mathematics | Male | Female |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2022 | na | na | na | 88.83 | 86.32 | 91.44 | na | na | na | 91.98 | 91.53 | 92.44 |
| 2021 | 96.66 | 95.46 | 97.93 | na | na | na | na | na | na | na | na | na |
| 2019 | na | na | na | na | na | na | 95.51 | 95.07 | 95.98 | 91.82 | 90.47 | 93.23 |
| 2018 | na | na | na | 88.75 | 85.77 | 91.86 | na | na | na | na | na | na |
| 2016 | 97.25 | 96.4 | 98.16 | na | na | na | na | na | na | na | na | na |
| 2015 | na | na | na | 88.85 | 86.34 | 91.53 | 93.39 | 92.73 | 94.1 | 93.55 | 92.27 | 94.96 |
| 2012 | na | na | na | 90.13 | 86.58 | 93.84 | na | na | na | na | na | na |
| 2011 | 96.79 | 95.75 | 97.9 | na | na | na | 93.87 | 92.94 | 94.85 | 92.48 | 90.41 | 94.61 |
| 2009 | na | na | na | 87.52 | 83.83 | 91.34 | na | na | na | na | na | na |
| 2007 | na | na | na | na | na | na | 91.75 | 90.38 | 93.22 | 88.17 | 85.79 | 90.79 |
| 2006 | 96.65 | 95.54 | 97.98 | na | na | na | na | na | na | na | na | na |
| 2003 | na | na | na | na | na | na | 91.2 | 89.81 | 92.65 | 92.85 | 91.53 | 94.29 |
| 2001 | 90.21 | 87.55 | 93.07 | na | na | na | na | na | na | na | na | na |
| 1999 | na | na | na | na | na | na | na | na | na | 93.87 | 93.11 | 94.68 |
| 1995 | na | na | na | na | na | na | 84.98 | 83.41 | 86.74 | 95.34 | 95.37 | 95.39 |
M850691.csv contains all of the information related to net enrolment rates in primary and secondary education in Singapore, with a focus on gender parity and overall enrolment metrics. This file provides insights into the enrolment rates and gender parity in education within Singapore, allowing for analysis of trends in educational access across different demographics.
df4=load_and_clean_MXXXXXX('M850691', 9, 29)
df4
| Data Series | Total Net Enrolment Rate - Primary Education (Total) (Per Cent) | Total Net Enrolment Rate - Primary Education (Male) (Per Cent) | Total Net Enrolment Rate - Primary Education (Female) (Per Cent) | Total Net Enrolment Rate - Secondary Education (Total) (Per Cent) | Total Net Enrolment Rate - Secondary Education (Male) (Per Cent) | Total Net Enrolment Rate - Secondary Education (Female) (Per Cent) | Total Net Enrolment Rate - Lower Secondary Education (Total) (Per Cent) | Total Net Enrolment Rate - Lower Secondary Education (Male) (Per Cent) | Total Net Enrolment Rate - Lower Secondary Education (Female) (Per Cent) | Total Net Enrolment Rate - Upper Secondary Education (Total) (Per Cent) | Total Net Enrolment Rate - Upper Secondary Education (Male) (Per Cent) | Total Net Enrolment Rate - Upper Secondary Education (Female) (Per Cent) | Gender Parity Index For Total Net Enrolment Rate, Primary Education (Ratio Expressed) | Gender Parity Index For Total Net Enrolment Rate, Secondary Education (Ratio Expressed) | Gender Parity Index For Total Net Enrolment Rate, Lower Secondary Education (Ratio Expressed) | Gender Parity Index For Total Net Enrolment Rate, Upper Secondary Education (Ratio Expressed) | Pupil-Teacher Ratio In Primary School (Per Cent) | Pupil-Teacher Ratio In Secondary School (Per Cent) |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2022 | 99.1 | 98.9 | 99.2 | 99.6 | 99.7 | 99.5 | 100.0 | 99.9 | 100.0 | 98.3 | 98.5 | 98.0 | 1.0 | 1.0 | 1.00 | 0.99 | 14.0 | 11.9 |
| 2021 | 99.4 | 99.4 | 99.3 | 99.1 | 99.0 | 99.2 | 99.5 | 99.4 | 99.7 | 98.6 | 98.5 | 98.7 | 1.0 | 1.0 | 1.00 | 1.00 | 13.8 | 11.6 |
| 2020 | 99.9 | 99.8 | 100.0 | 99.0 | 99.2 | 98.8 | 99.1 | 99.2 | 99.0 | 98.9 | 99.2 | 98.6 | 1.0 | 1.0 | 1.00 | 0.99 | 13.9 | 11.5 |
| 2019 | 99.9 | 99.8 | 100.0 | 99.7 | 99.7 | 99.7 | 99.7 | 99.6 | 99.8 | 99.7 | 99.9 | 99.6 | 1.0 | 1.0 | 1.00 | 1.00 | 14.2 | 11.3 |
| 2018 | 99.8 | 99.7 | 99.9 | 99.8 | 100.0 | 99.7 | 99.4 | 99.8 | 99.1 | 100.0 | 99.9 | 100.0 | 1.0 | 1.0 | 0.99 | 1.00 | 14.3 | 11.3 |
| 2017 | 100.0 | 99.9 | 100.0 | 100.0 | 100.0 | 100.0 | 99.9 | 100.0 | 99.9 | 99.9 | 99.8 | 100.0 | 1.0 | 1.0 | 1.00 | 1.00 | 14.7 | 11.5 |
| 2016 | 100.0 | 100.0 | 100.0 | 100.0 | 99.9 | 100.0 | 99.9 | 100.0 | 99.9 | 99.9 | 99.8 | 100.0 | 1.0 | 1.0 | 1.00 | 1.00 | 15.1 | 11.7 |
M183401.csv contains all of the information related to educational attainment levels and their corresponding proportions within different age groups and genders in Singapore. This file provides a detailed breakdown of how different educational qualifications are distributed across various age groups and genders in Singapore, enabling analysis of trends in educational attainment within the population.
df5=load_and_clean_MXXXXXX('M183401', 10, 21)
df5
| Data Series | Total | Sex: Male | Sex: Female | Age Group (Years): Below 30 | Age Group (Years): 25 - 29 | Age Group (Years): 30 - 39 | Age Group (Years): 40 - 49 | Age Group (Years): 50 & Over | Age Group (Years): 50 - 59 | Age Group (Years): 60 & Over | Age Group (Years): 25 & Over | Age Group (Years): 15 - 24 | Sex: Male | Sex: Female | Highest Qualification Attained: Below Secondary | Highest Qualification Attained: Secondary | Highest Qualification Attained: Post-Secondary (Non-Tertiary) | Highest Qualification Attained: Diploma & Professional Qualification | Highest Qualification Attained: Degree |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2023 | 2.7 | 2.7 | 2.7 | 5.2 | 4.3 | 2.1 | 2.3 | 2.4 | 2.5 | 2.2 | 2.5 | 6.7 | 5.4 | 8.2 | 2.4 | 2.9 | 3.2 | 2.8 | 2.6 |
| 2022 | 2.9 | 2.8 | 3.0 | 4.7 | 4.0 | 2.3 | 2.5 | 2.8 | 2.7 | 2.9 | 2.7 | 5.9 | 4.0 | 8.2 | 2.5 | 3.1 | 3.4 | 3.0 | 2.8 |
| 2021 | 3.5 | 3.3 | 3.8 | 5.7 | 4.7 | 2.6 | 3.2 | 3.4 | 3.5 | 3.4 | 3.3 | 7.3 | 5.0 | 10.0 | 3.1 | 4.1 | 4.5 | 3.9 | 3.2 |
| 2020 | 4.1 | 3.9 | 4.4 | 7.5 | 5.5 | 3.0 | 3.4 | 3.9 | 3.9 | 3.7 | 3.7 | 10.6 | 8.1 | 13.5 | 4.1 | 4.6 | 4.9 | 4.7 | 3.5 |
| 2019 | 3.1 | 3.0 | 3.4 | 5.9 | 4.6 | 2.2 | 2.5 | 2.9 | 3.1 | 2.7 | 2.8 | 7.7 | 5.8 | 10.0 | 2.9 | 3.5 | 3.6 | 3.3 | 2.9 |
| 2018 | 2.9 | 2.9 | 3.0 | 5.2 | 4.1 | 2.2 | 2.3 | 2.6 | 2.7 | 2.5 | 2.6 | 6.6 | 5.0 | 8.6 | 2.5 | 2.8 | 3.4 | 3.1 | 2.9 |
| 2017 | 3.1 | 3.1 | 3.1 | 5.4 | 4.3 | 2.4 | 2.5 | 2.8 | 3.0 | 2.5 | 2.8 | 7.0 | 5.2 | 9.1 | 2.6 | 2.8 | 3.8 | 3.3 | 3.2 |
| 2016 | 3.0 | 3.0 | 3.0 | 5.0 | 3.8 | 2.3 | 2.4 | 2.7 | 2.7 | 2.6 | 2.7 | 6.5 | 4.6 | 8.8 | 2.4 | 3.1 | 3.2 | 3.1 | 3.1 |
| 2015 | 2.8 | 2.7 | 2.9 | 5.1 | 3.6 | 1.9 | 2.4 | 2.4 | 2.4 | 2.3 | 2.4 | 6.7 | 5.6 | 8.1 | 2.5 | 2.7 | 3.0 | 2.9 | 2.8 |
| 2014 | 2.7 | 2.7 | 2.9 | 5.2 | 4.1 | 2.2 | 2.1 | 2.3 | 2.2 | 2.4 | 2.4 | 6.4 | 4.9 | 8.1 | 2.5 | 2.8 | 2.7 | 2.9 | 2.9 |
| 2013 | 2.8 | 2.7 | 2.9 | 5.2 | 3.9 | 2.2 | 2.2 | 2.3 | 2.3 | 2.2 | 2.4 | 6.7 | 5.2 | 8.6 | 2.4 | 2.9 | 3.5 | 2.7 | 2.8 |
| 2012 | 2.8 | 2.7 | 3.1 | 5.1 | 3.5 | 2.2 | 2.1 | 2.4 | 2.2 | 2.8 | 2.4 | 6.7 | 5.3 | 8.2 | 2.6 | 2.9 | 3.5 | 2.8 | 2.7 |
| 2011 | 2.9 | 2.6 | 3.2 | 5.0 | 3.4 | 2.4 | 2.1 | 2.5 | 2.5 | 2.5 | 2.5 | 6.7 | 4.8 | 9.0 | 2.8 | 3.5 | 3.2 | 2.7 | 2.6 |
| 2010 | 3.1 | 3.0 | 3.4 | 5.5 | 3.9 | 2.3 | 2.6 | 2.7 | 2.6 | 2.9 | 2.7 | 7.4 | 5.5 | 9.6 | 3.4 | 3.3 | 3.8 | 2.9 | 2.8 |
| 2009 | 4.3 | 4.1 | 4.7 | 6.7 | 5.0 | 3.6 | 3.8 | 3.9 | 4.0 | 3.7 | 3.9 | 8.8 | 6.6 | 11.5 | 4.9 | 4.7 | 4.8 | 3.9 | 3.6 |
| 2008 | 3.2 | 3.0 | 3.5 | 5.2 | 3.7 | 2.6 | 2.7 | 2.9 | 3.1 | 2.3 | 2.8 | 7.0 | 5.4 | 8.9 | 3.9 | 3.3 | 3.7 | 2.8 | 2.7 |
| 2007 | 3.0 | 2.8 | 3.1 | 4.5 | 3.0 | 2.4 | 2.4 | 2.9 | 2.9 | 2.6 | 2.6 | 6.3 | 4.6 | 8.2 | 3.2 | 3.4 | 2.7 | 2.9 | 2.4 |
| 2006 | 3.6 | 3.4 | 3.7 | 5.4 | 3.8 | 3.0 | 3.0 | 3.4 | 3.5 | 2.8 | 3.2 | 7.1 | 6.1 | 8.3 | 4.2 | 3.9 | 3.5 | 3.2 | 2.8 |
| 2005 | 4.1 | 3.9 | 4.5 | 6.2 | 4.4 | 3.1 | 3.6 | 4.1 | 4.2 | 3.8 | 3.7 | 8.3 | 6.2 | 10.8 | 5.2 | 4.6 | 4.1 | 3.2 | 3.1 |
| 2004 | 4.4 | 4.5 | 4.4 | 6.7 | 5.1 | 3.5 | 3.8 | 4.3 | 4.6 | 3.0 | 4.0 | 8.6 | 6.0 | 11.6 | 5.2 | 4.6 | 3.8 | 4.2 | 3.6 |
| 2003 | 5.2 | 5.1 | 5.3 | 7.3 | 5.8 | 4.5 | 4.9 | 4.5 | 5.0 | 2.8 | 4.8 | 9.3 | 6.7 | 12.3 | 6.3 | 5.3 | 4.2 | 4.7 | 4.4 |
| 2002 | 4.8 | 4.8 | 4.8 | 6.6 | 5.0 | 3.9 | 4.6 | 4.4 | 4.8 | 3.3 | 4.4 | 8.8 | 5.5 | 12.3 | 5.8 | 4.8 | 3.8 | 4.1 | 4.1 |
| 2001 | 3.7 | 3.9 | 3.4 | 5.1 | 4.3 | 3.1 | 3.4 | 3.5 | 3.8 | 2.3 | 3.4 | 6.2 | 4.6 | 7.8 | 4.7 | 3.5 | 2.9 | 3.4 | 2.9 |
| 2000 | 3.7 | 3.9 | 3.5 | 4.6 | 3.6 | 3.1 | 3.7 | 3.4 | 3.6 | 2.7 | 3.4 | 5.9 | 4.7 | 7.1 | 5.2 | 3.6 | 2.4 | 2.4 | 2.6 |
| 1999 | 3.8 | 3.8 | 3.8 | 5.2 | 4 | 3.2 | 3.4 | 3.4 | 3.7 | 2.7 | 3.4 | 6.8 | na | na | 4.8 | 3.6 | 2.9 | 2.6 | 3.1 |
| 1998 | 3.5 | 3.5 | 3.4 | 5.1 | 3.8 | 2.8 | 3.2 | 2.7 | 3.3 | 1 | 3.1 | 6.7 | na | na | 4.3 | 3.4 | 1.9 | 2.9 | 2.8 |
| 1997 | 2 | 2 | 1.9 | 3.2 | 2.3 | 1.5 | 1.6 | 1.5 | 1.5 | 1.4 | 1.7 | 4.3 | na | na | 2.4 | 1.7 | 1.6 | 1.8 | 1.7 |
| 1996 | 2.2 | 2.3 | 2 | 3.4 | 2.7 | 1.7 | 1.6 | 1.9 | 2.1 | 1.3 | 1.9 | 4.2 | na | na | 2.5 | 2 | 1.5 | 1.7 | 2.6 |
| 1995 | 2.2 | 2.3 | 2.1 | 3.6 | 2.5 | 1.7 | 1.6 | 1.8 | 2 | 1.2 | 1.8 | 4.8 | na | na | 2.7 | 2 | 1.5 | 1.9 | 2.1 |
| 1994 | 2.2 | 2.2 | 2.1 | 3.6 | 2.4 | 1.6 | 1.5 | 1.4 | 1.6 | 1 | 1.7 | 4.8 | na | na | 2.5 | 1.8 | 1.4 | 2 | 2.6 |
| 1993 | 2.1 | 2.1 | 2 | 3.3 | 2.3 | 1.6 | 1.4 | 1.5 | 1.6 | 1.2 | 1.7 | 4.3 | na | na | 2.4 | 2 | 1.2 | 1.7 | 2.3 |
| 1992 | 2.2 | 2 | 2.5 | 3.3 | 2 | 1.7 | 1.7 | 1.3 | 1.4 | 1.2 | 1.7 | 4.6 | na | na | 2.5 | 2 | 1.6 | 1.5 | 2.2 |
M015651.csv contains all of the information related to the Gross Domestic Product (GDP) of Singapore, broken down by industry sector, including goods-producing industries and services-producing industries. This file provides a detailed breakdown of GDP contributions by various sectors of the Singapore economy, enabling analysis of economic performance across different industries.
df6=load_and_clean_MXXXXXX('M015651', 10, 31)
df6
| Data Series | GDP At Current Market Prices | Goods Producing Industries | Manufacturing | Construction | Utilities | Other Goods Industries | Services Producing Industries | Wholesale & Retail Trade | Wholesale Trade | Retail Trade | ... | Administrative & Support Services | Other Services Industries | Public Administration & Defence | Education | Health & Social Services | Arts, Entertainment & Recreation | Other Services - Others | Ownership Of Dwellings | Gross Value Added At Basic Prices | Add: Taxes On Products |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2024 2Q | 180428.1 | 39146.9 | 31050.7 | 5708.6 | 2340.9 | 46.7 | 125263.6 | 40108.1 | 38084.5 | 2023.6 | ... | 4532.7 | 14774.4 | 3283.2 | 3724.3 | 4456.0 | 1521.3 | 1789.6 | 6581.6 | 170992.1 | 9436.0 |
| 2024 1Q | 172917.4 | 38286.4 | 30079.4 | 5865.9 | 2292.4 | 48.7 | 119386.3 | 34139.4 | 32036.1 | 2103.3 | ... | 4753.1 | 17496.0 | 4993.1 | 4597.6 | 4174.2 | 1904.7 | 1826.4 | 6360.5 | 164033.2 | 8884.2 |
| 2023 4Q | 176412.1 | 36945.0 | 28355.3 | 6112.7 | 2423.6 | 53.4 | 124243.1 | 44710.6 | 42513.2 | 2197.4 | ... | 4639.9 | 16168.3 | 4330.8 | 4202.1 | 4134.7 | 1723.6 | 1777.1 | 6079.8 | 167267.9 | 9144.2 |
| 2023 3Q | 168373.8 | 37946.2 | 29832.9 | 5600.7 | 2460.8 | 51.8 | 115633.9 | 38452.0 | 36420.8 | 2031.2 | ... | 4427.5 | 14778.4 | 3463.6 | 3869.1 | 4138.6 | 1641.6 | 1665.5 | 5898.9 | 159479.0 | 8894.8 |
| 2023 2Q | 167351.6 | 38067.5 | 30281.6 | 5311.4 | 2429.2 | 45.3 | 115418.3 | 38151.6 | 36136.2 | 2015.4 | ... | 4382.1 | 13839.0 | 3157.0 | 3576.8 | 4065.9 | 1368.3 | 1671.0 | 5729.5 | 159215.3 | 8136.3 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 1976 1Q | 3545.1 | 1166.6 | 744.8 | 271.7 | 73.4 | 76.7 | 2154.2 | 677.5 | na | na | ... | na | 357.1 | na | na | na | na | na | 88.6 | 3409.4 | 135.7 |
| 1975 4Q | 3693 | 1269.6 | 810.1 | 294.9 | 83.2 | 81.4 | 2206.6 | 724.8 | na | na | ... | na | 402.4 | na | na | na | na | na | 86.7 | 3562.9 | 130.1 |
| 1975 3Q | 3519.3 | 1205.7 | 785.1 | 272.6 | 73.5 | 74.5 | 2105.2 | 716 | na | na | ... | na | 344.7 | na | na | na | na | na | 86 | 3396.9 | 122.4 |
| 1975 2Q | 3307 | 1090.9 | 692.9 | 256.3 | 71.2 | 70.5 | 2019.6 | 704.9 | na | na | ... | na | 324.1 | na | na | na | na | na | 83.5 | 3194 | 113 |
| 1975 1Q | 3209.4 | 1014.2 | 654.9 | 228.5 | 61.3 | 69.5 | 1991.7 | 672.8 | na | na | ... | na | 328.1 | na | na | na | na | na | 81.2 | 3087.1 | 122.3 |
198 rows × 29 columns
17892.csv contains all of the information related to the Gini coefficient of Singapore, specifically focusing on household income inequality before and after accounting for government transfers and taxes. This file provides insights into how government policies such as taxes impact income inequality in Singapore, as reflected by changes in the Gini coefficient.
df7=load_and_clean_MXXXXXX('17892', 10, 27)
df7
| Before / After Accounting for Government Transfers and Taxes | Household Income from Work Per Household Member (Including Employer CPF Contributions) | Household Income from Work Per Household Member (Including Employer CPF Contributions) After Accounting for Government Transfers and Taxes1/ |
|---|---|---|
| 2023 | 0.433 | 0.371 |
| 2022 | 0.437 | 0.378 |
| 2021 | 0.444 | 0.385 |
| 2020 | 0.452 | 0.375 |
| 2019 | 0.452 | 0.398 |
| 2018 | 0.458 | 0.403 |
| 2017 | 0.459 | 0.402 |
| 2016 | 0.458 | 0.401 |
| 2015 | 0.463 | 0.409 |
| 2014 | 0.464 | 0.411 |
| 2013 | 0.463 | 0.409 |
| 2012 | 0.478 | 0.432 |
| 2011 | 0.473 | 0.423 |
| 2010 | 0.472 | 0.425 |
| 2009 | 0.471 | 0.422 |
| 2008 | 0.474 | 0.424 |
| 2007 | 0.482 | 0.439 |
| 2006 | 0.470 | 0.418 |
| 2005 | 0.465 | 0.422 |
| 2004 | 0.460 | 0.419 |
| 2003 | 0.457 | 0.422 |
| 2002 | 0.454 | 0.414 |
| 2001 | 0.454 | 0.419 |
| 2000 | 0.442 | 0.414 |
| Unnamed: 25 | NaN | NaN |
The below file contains all of the information related to various socioeconomic indicators, across regions and time.
df8=pd.read_csv('./data/DEMO_DS_23082024021945252.csv')
df8
| DEMO_IND | Indicator | LOCATION | Country | TIME | Time | Value | Flag Codes | Flags | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | NY_GDP_MKTP_CN | GDP (current LCU) | AUS | Australia | 2017 | 2017 | 1.758828e+12 | NaN | NaN |
| 1 | NY_GDP_MKTP_CN | GDP (current LCU) | AUS | Australia | 2018 | 2018 | 1.842635e+12 | NaN | NaN |
| 2 | NY_GDP_MKTP_CN | GDP (current LCU) | AUS | Australia | 2019 | 2019 | 1.946613e+12 | NaN | NaN |
| 3 | NY_GDP_MKTP_CN | GDP (current LCU) | AUS | Australia | 2020 | 2020 | 1.979471e+12 | NaN | NaN |
| 4 | NY_GDP_MKTP_CN | GDP (current LCU) | AUS | Australia | 2021 | 2021 | 2.080419e+12 | NaN | NaN |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 27109 | NY_GDP_DEFL_ZS | GDP deflator (base year varies by country) | FRO | Faeroe Islands | 2017 | 2017 | 1.142000e+02 | NaN | NaN |
| 27110 | NY_GDP_DEFL_ZS | GDP deflator (base year varies by country) | FRO | Faeroe Islands | 2018 | 2018 | 1.138000e+02 | NaN | NaN |
| 27111 | NY_GDP_DEFL_ZS | GDP deflator (base year varies by country) | FRO | Faeroe Islands | 2019 | 2019 | 1.176000e+02 | NaN | NaN |
| 27112 | NY_GDP_DEFL_ZS | GDP deflator (base year varies by country) | FRO | Faeroe Islands | 2020 | 2020 | 1.175000e+02 | NaN | NaN |
| 27113 | NY_GDP_DEFL_ZS | GDP deflator (base year varies by country) | FRO | Faeroe Islands | 2021 | 2021 | 1.201000e+02 | NaN | NaN |
27114 rows × 9 columns
The below file contains all of the information related to Government expenditure on education, across various regions and time.
df9=pd.read_csv('./data/NATMON_DS_23082024022406407.csv')
df9
| NATMON_IND | Indicator | LOCATION | Country | TIME | Time | Value | Flag Codes | Flags | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | X_US_1_FSGOV | Government expenditure on primary education, U... | UKR | Ukraine | 2017 | 2017 | 1316.29505 | NaN | NaN |
| 1 | X_US_1_FSGOV | Government expenditure on primary education, U... | UKR | Ukraine | 2018 | 2018 | 1595.33510 | NaN | NaN |
| 2 | X_US_1_FSGOV | Government expenditure on primary education, U... | UKR | Ukraine | 2019 | 2019 | 1857.26190 | NaN | NaN |
| 3 | X_US_1_FSGOV | Government expenditure on primary education, U... | UKR | Ukraine | 2020 | 2020 | 1840.77686 | NaN | NaN |
| 4 | X_US_5T8_FSGOV | Government expenditure on tertiary education, ... | MLI | Mali | 2017 | 2017 | 100.66726 | NaN | NaN |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 3776 | X_US_2T3_FSGOV | Government expenditure on secondary education,... | UZB | Uzbekistan | 2020 | 2020 | 1116.55417 | NaN | NaN |
| 3777 | X_US_2T3_FSGOV | Government expenditure on secondary education,... | UZB | Uzbekistan | 2021 | 2021 | 1050.16985 | NaN | NaN |
| 3778 | X_US_2T4_V_FSGOV | Government expenditure on secondary and post-s... | USA | United States of America | 2020 | 2020 | 5044.48273 | NaN | NaN |
| 3779 | X_US_2T4_V_FSGOV | Government expenditure on secondary and post-s... | ARG | Argentina | 2020 | 2020 | NaN | a | Category not applicable |
| 3780 | X_US_2T4_V_FSGOV | Government expenditure on secondary and post-s... | ARG | Argentina | 2021 | 2021 | NaN | a | Category not applicable |
3781 rows × 9 columns
The below file contains all of the information related to 'happiness' (which we will take as an indicator of mental health), and time
df10=pd.read_csv('./data/DataForTable2 (1).csv', encoding='unicode_escape')
df10
| Country name | year | Life Ladder | Log GDP per capita | Social support | Healthy life expectancy at birth | Freedom to make life choices | Generosity | Perceptions of corruption | Positive affect | Negative affect | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Afghanistan | 2008 | 3.724 | 7.350 | 0.451 | 50.500 | 0.718 | 0.164 | 0.882 | 0.414 | 0.258 |
| 1 | Afghanistan | 2009 | 4.402 | 7.509 | 0.552 | 50.800 | 0.679 | 0.187 | 0.850 | 0.481 | 0.237 |
| 2 | Afghanistan | 2010 | 4.758 | 7.614 | 0.539 | 51.100 | 0.600 | 0.118 | 0.707 | 0.517 | 0.275 |
| 3 | Afghanistan | 2011 | 3.832 | 7.581 | 0.521 | 51.400 | 0.496 | 0.160 | 0.731 | 0.480 | 0.267 |
| 4 | Afghanistan | 2012 | 3.783 | 7.661 | 0.521 | 51.700 | 0.531 | 0.234 | 0.776 | 0.614 | 0.268 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 2358 | Zimbabwe | 2019 | 2.694 | 7.698 | 0.759 | 53.100 | 0.632 | -0.051 | 0.831 | 0.658 | 0.235 |
| 2359 | Zimbabwe | 2020 | 3.160 | 7.596 | 0.717 | 53.575 | 0.643 | 0.003 | 0.789 | 0.661 | 0.346 |
| 2360 | Zimbabwe | 2021 | 3.155 | 7.657 | 0.685 | 54.050 | 0.668 | -0.079 | 0.757 | 0.610 | 0.242 |
| 2361 | Zimbabwe | 2022 | 3.296 | 7.670 | 0.666 | 54.525 | 0.652 | -0.073 | 0.753 | 0.641 | 0.191 |
| 2362 | Zimbabwe | 2023 | 3.572 | 7.679 | 0.694 | 55.000 | 0.735 | -0.069 | 0.757 | 0.610 | 0.179 |
2363 rows × 11 columns
The below file contains all of the information related to governmental expenditure on education as a percentage of gdp
df11=pd.read_csv('./data/API_SE.XPD.TOTL.GD.ZS_DS2_en_csv_v2_68.csv', skiprows=4)
df11
| Country Name | Country Code | Indicator Name | Indicator Code | 1960 | 1961 | 1962 | 1963 | 1964 | 1965 | ... | 2015 | 2016 | 2017 | 2018 | 2019 | 2020 | 2021 | 2022 | 2023 | Unnamed: 68 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Aruba | ABW | Government expenditure on education, total (% ... | SE.XPD.TOTL.GD.ZS | NaN | NaN | NaN | NaN | NaN | NaN | ... | 5.887830 | 5.49106 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 1 | Africa Eastern and Southern | AFE | Government expenditure on education, total (% ... | SE.XPD.TOTL.GD.ZS | NaN | NaN | NaN | NaN | NaN | NaN | ... | 4.750985 | 4.88207 | 4.820445 | 4.73975 | 4.511475 | 4.35244 | 4.67441 | 4.24412 | NaN | NaN |
| 2 | Afghanistan | AFG | Government expenditure on education, total (% ... | SE.XPD.TOTL.GD.ZS | NaN | NaN | NaN | NaN | NaN | NaN | ... | 3.255800 | 4.54397 | 4.343190 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 3 | Africa Western and Central | AFW | Government expenditure on education, total (% ... | SE.XPD.TOTL.GD.ZS | NaN | NaN | NaN | NaN | NaN | NaN | ... | 3.195490 | 2.78355 | 3.535590 | 3.07200 | 3.100000 | 3.40400 | 3.28288 | 2.92850 | NaN | NaN |
| 4 | Angola | AGO | Government expenditure on education, total (% ... | SE.XPD.TOTL.GD.ZS | NaN | NaN | NaN | NaN | NaN | NaN | ... | 3.100000 | 2.75500 | 2.467000 | 2.04500 | 2.073000 | 2.74500 | 2.29700 | 2.33200 | NaN | NaN |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 261 | Kosovo | XKX | Government expenditure on education, total (% ... | SE.XPD.TOTL.GD.ZS | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 262 | Yemen, Rep. | YEM | Government expenditure on education, total (% ... | SE.XPD.TOTL.GD.ZS | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 263 | South Africa | ZAF | Government expenditure on education, total (% ... | SE.XPD.TOTL.GD.ZS | NaN | NaN | NaN | NaN | NaN | NaN | ... | 5.482850 | 5.44424 | 5.598670 | 5.64401 | 5.927700 | 6.18336 | 6.56206 | 6.18348 | 6.604 | NaN |
| 264 | Zambia | ZMB | Government expenditure on education, total (% ... | SE.XPD.TOTL.GD.ZS | NaN | NaN | NaN | NaN | NaN | NaN | ... | 4.624330 | 3.74792 | 3.729640 | 4.73975 | 4.418000 | 3.95000 | 3.10600 | 3.58300 | NaN | NaN |
| 265 | Zimbabwe | ZWE | Government expenditure on education, total (% ... | SE.XPD.TOTL.GD.ZS | NaN | NaN | NaN | NaN | NaN | NaN | ... | 5.813000 | 5.47300 | 5.818780 | 2.05049 | NaN | NaN | NaN | NaN | NaN | NaN |
266 rows × 69 columns
The below file contains all of the information related to global gdp growth (annual %)
df12=pd.read_csv('./data/API_NY.GDP.MKTP.KD.ZG_DS2_en_csv_v2_59.csv', skiprows=4)
df12
| Country Name | Country Code | Indicator Name | Indicator Code | 1960 | 1961 | 1962 | 1963 | 1964 | 1965 | ... | 2015 | 2016 | 2017 | 2018 | 2019 | 2020 | 2021 | 2022 | 2023 | Unnamed: 68 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Aruba | ABW | GDP growth (annual %) | NY.GDP.MKTP.KD.ZG | NaN | NaN | NaN | NaN | NaN | NaN | ... | -0.623626 | 1.719625 | 7.048533 | 2.381730 | -2.302836 | -23.982581 | 27.639357 | 10.458317 | NaN | NaN |
| 1 | Africa Eastern and Southern | AFE | GDP growth (annual %) | NY.GDP.MKTP.KD.ZG | NaN | 0.460106 | 7.868013 | 5.616400 | 4.668135 | 5.138990 | ... | 3.011572 | 2.183581 | 2.651400 | 2.681718 | 2.189259 | -2.780379 | 4.497380 | 3.554422 | 2.622310 | NaN |
| 2 | Afghanistan | AFG | GDP growth (annual %) | NY.GDP.MKTP.KD.ZG | NaN | NaN | NaN | NaN | NaN | NaN | ... | 1.451315 | 2.260314 | 2.647003 | 1.189228 | 3.911603 | -2.351101 | -20.738839 | -6.240172 | NaN | NaN |
| 3 | Africa Western and Central | AFW | GDP growth (annual %) | NY.GDP.MKTP.KD.ZG | NaN | 1.873455 | 3.707643 | 7.145784 | 5.406403 | 4.102491 | ... | 2.813979 | 0.141489 | 2.284961 | 2.842645 | 3.222029 | -1.004417 | 3.994854 | 3.784486 | 3.342776 | NaN |
| 4 | Angola | AGO | GDP growth (annual %) | NY.GDP.MKTP.KD.ZG | NaN | NaN | NaN | NaN | NaN | NaN | ... | 0.943572 | -2.580111 | -0.147150 | -1.316362 | -0.702273 | -5.638215 | 1.199211 | 3.045403 | 0.856001 | NaN |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 261 | Kosovo | XKX | GDP growth (annual %) | NY.GDP.MKTP.KD.ZG | NaN | NaN | NaN | NaN | NaN | NaN | ... | 5.916231 | 5.571775 | 4.825656 | 3.406632 | 4.756801 | -5.340275 | 10.745656 | 4.278499 | 3.337895 | NaN |
| 262 | Yemen, Rep. | YEM | GDP growth (annual %) | NY.GDP.MKTP.KD.ZG | NaN | NaN | NaN | NaN | NaN | NaN | ... | -27.994546 | -9.375124 | -5.071796 | 0.752448 | NaN | NaN | NaN | NaN | NaN | NaN |
| 263 | South Africa | ZAF | GDP growth (annual %) | NY.GDP.MKTP.KD.ZG | NaN | 3.844734 | 6.177931 | 7.373709 | 7.939609 | 6.122798 | ... | 1.321862 | 0.664552 | 1.157947 | 1.556784 | 0.259936 | -5.963358 | 4.703062 | 1.910406 | 0.601662 | NaN |
| 264 | Zambia | ZMB | GDP growth (annual %) | NY.GDP.MKTP.KD.ZG | NaN | 1.361382 | -2.490839 | 3.272393 | 12.214048 | 16.647456 | ... | 2.920375 | 3.755100 | 3.525863 | 4.034494 | 1.441306 | -2.785055 | 6.234922 | 5.249622 | 5.832474 | NaN |
| 265 | Zimbabwe | ZWE | GDP growth (annual %) | NY.GDP.MKTP.KD.ZG | NaN | 6.316157 | 1.434471 | 6.244345 | -1.106172 | 4.910571 | ... | 2.023650 | 0.900955 | 4.080264 | 5.009867 | -6.332446 | -7.816951 | 8.468017 | 6.522375 | 4.955782 | NaN |
266 rows × 69 columns
This dataset covers info about Global gdp in Current US$
df13=pd.read_csv('./data/API_NY.GDP.MKTP.CD_DS2_en_csv_v2_31795.csv', skiprows=4)
df13
| Country Name | Country Code | Indicator Name | Indicator Code | 1960 | 1961 | 1962 | 1963 | 1964 | 1965 | ... | 2015 | 2016 | 2017 | 2018 | 2019 | 2020 | 2021 | 2022 | 2023 | Unnamed: 68 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Aruba | ABW | GDP (current US$) | NY.GDP.MKTP.CD | NaN | NaN | NaN | NaN | NaN | NaN | ... | 2.962907e+09 | 2.983635e+09 | 3.092429e+09 | 3.276184e+09 | 3.395799e+09 | 2.558906e+09 | 3.103184e+09 | 3.544708e+09 | NaN | NaN |
| 1 | Africa Eastern and Southern | AFE | GDP (current US$) | NY.GDP.MKTP.CD | 2.121696e+10 | 2.230747e+10 | 2.370247e+10 | 2.577938e+10 | 2.804954e+10 | 3.037491e+10 | ... | 8.992957e+11 | 8.298300e+11 | 9.401055e+11 | 1.012719e+12 | 1.006527e+12 | 9.290741e+11 | 1.086772e+12 | 1.183962e+12 | 1.236163e+12 | NaN |
| 2 | Afghanistan | AFG | GDP (current US$) | NY.GDP.MKTP.CD | NaN | NaN | NaN | NaN | NaN | NaN | ... | 1.913422e+10 | 1.811657e+10 | 1.875346e+10 | 1.805322e+10 | 1.879944e+10 | 1.995593e+10 | 1.426650e+10 | 1.450216e+10 | NaN | NaN |
| 3 | Africa Western and Central | AFW | GDP (current US$) | NY.GDP.MKTP.CD | 1.188413e+10 | 1.268566e+10 | 1.360683e+10 | 1.443998e+10 | 1.576911e+10 | 1.693448e+10 | ... | 7.693673e+11 | 6.921811e+11 | 6.857502e+11 | 7.681896e+11 | 8.239336e+11 | 7.871467e+11 | 8.459930e+11 | 8.771408e+11 | 7.965862e+11 | NaN |
| 4 | Angola | AGO | GDP (current US$) | NY.GDP.MKTP.CD | NaN | NaN | NaN | NaN | NaN | NaN | ... | 9.049642e+10 | 5.276162e+10 | 7.369015e+10 | 7.945069e+10 | 7.089796e+10 | 4.850156e+10 | 6.650513e+10 | 1.043997e+11 | 8.472296e+10 | NaN |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 261 | Kosovo | XKX | GDP (current US$) | NY.GDP.MKTP.CD | NaN | NaN | NaN | NaN | NaN | NaN | ... | 6.295848e+09 | 6.682677e+09 | 7.180765e+09 | 7.878760e+09 | 7.899738e+09 | 7.717145e+09 | 9.413404e+09 | 9.354903e+09 | 1.043835e+10 | NaN |
| 262 | Yemen, Rep. | YEM | GDP (current US$) | NY.GDP.MKTP.CD | NaN | NaN | NaN | NaN | NaN | NaN | ... | 4.244449e+10 | 3.131783e+10 | 2.684223e+10 | 2.160616e+10 | NaN | NaN | NaN | NaN | NaN | NaN |
| 263 | South Africa | ZAF | GDP (current US$) | NY.GDP.MKTP.CD | 8.748597e+09 | 9.225996e+09 | 9.813996e+09 | 1.085420e+10 | 1.195600e+10 | 1.306899e+10 | ... | 3.467098e+11 | 3.235855e+11 | 3.814488e+11 | 4.052607e+11 | 3.893300e+11 | 3.382914e+11 | 4.201178e+11 | 4.052709e+11 | 3.777816e+11 | NaN |
| 264 | Zambia | ZMB | GDP (current US$) | NY.GDP.MKTP.CD | 7.130000e+08 | 6.962857e+08 | 6.931429e+08 | 7.187143e+08 | 8.394286e+08 | 1.082857e+09 | ... | 2.125122e+10 | 2.095841e+10 | 2.587360e+10 | 2.631151e+10 | 2.330867e+10 | 1.813776e+10 | 2.209642e+10 | 2.916378e+10 | 2.816263e+10 | NaN |
| 265 | Zimbabwe | ZWE | GDP (current US$) | NY.GDP.MKTP.CD | 1.052990e+09 | 1.096647e+09 | 1.117602e+09 | 1.159512e+09 | 1.217138e+09 | 1.311436e+09 | ... | 1.996312e+10 | 2.054868e+10 | 1.758489e+10 | 3.415607e+10 | 2.183223e+10 | 2.150970e+10 | 2.837124e+10 | 2.736663e+10 | 2.653827e+10 | NaN |
266 rows × 69 columns
This dataset covers info about Population
df14=pd.read_csv('./data/API_SP.POP.TOTL_DS2_en_csv_v2_31753.csv', skiprows=4)
df14
| Country Name | Country Code | Indicator Name | Indicator Code | 1960 | 1961 | 1962 | 1963 | 1964 | 1965 | ... | 2015 | 2016 | 2017 | 2018 | 2019 | 2020 | 2021 | 2022 | 2023 | Unnamed: 68 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Aruba | ABW | Population, total | SP.POP.TOTL | 54608.0 | 55811.0 | 56682.0 | 57475.0 | 58178.0 | 58782.0 | ... | 104257.0 | 104874.0 | 105439.0 | 105962.0 | 106442.0 | 106585.0 | 106537.0 | 106445.0 | 106277.0 | NaN |
| 1 | Africa Eastern and Southern | AFE | Population, total | SP.POP.TOTL | 130692579.0 | 134169237.0 | 137835590.0 | 141630546.0 | 145605995.0 | 149742351.0 | ... | 600008424.0 | 616377605.0 | 632746570.0 | 649757148.0 | 667242986.0 | 685112979.0 | 702977106.0 | 720859132.0 | 739108306.0 | NaN |
| 2 | Afghanistan | AFG | Population, total | SP.POP.TOTL | 8622466.0 | 8790140.0 | 8969047.0 | 9157465.0 | 9355514.0 | 9565147.0 | ... | 33753499.0 | 34636207.0 | 35643418.0 | 36686784.0 | 37769499.0 | 38972230.0 | 40099462.0 | 41128771.0 | 42239854.0 | NaN |
| 3 | Africa Western and Central | AFW | Population, total | SP.POP.TOTL | 97256290.0 | 99314028.0 | 101445032.0 | 103667517.0 | 105959979.0 | 108336203.0 | ... | 408690375.0 | 419778384.0 | 431138704.0 | 442646825.0 | 454306063.0 | 466189102.0 | 478185907.0 | 490330870.0 | 502789511.0 | NaN |
| 4 | Angola | AGO | Population, total | SP.POP.TOTL | 5357195.0 | 5441333.0 | 5521400.0 | 5599827.0 | 5673199.0 | 5736582.0 | ... | 28127721.0 | 29154746.0 | 30208628.0 | 31273533.0 | 32353588.0 | 33428486.0 | 34503774.0 | 35588987.0 | 36684202.0 | NaN |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 261 | Kosovo | XKX | Population, total | SP.POP.TOTL | 990150.0 | 1014211.0 | 1038618.0 | 1063175.0 | 1087700.0 | 1111812.0 | ... | 1788196.0 | 1777557.0 | 1791003.0 | 1797085.0 | 1788878.0 | 1790133.0 | 1786038.0 | 1768086.0 | 1756374.0 | NaN |
| 262 | Yemen, Rep. | YEM | Population, total | SP.POP.TOTL | 5542459.0 | 5646668.0 | 5753386.0 | 5860197.0 | 5973803.0 | 6097298.0 | ... | 28516545.0 | 29274002.0 | 30034389.0 | 30790513.0 | 31546691.0 | 32284046.0 | 32981641.0 | 33696614.0 | 34449825.0 | NaN |
| 263 | South Africa | ZAF | Population, total | SP.POP.TOTL | 16520441.0 | 16989464.0 | 17503133.0 | 18042215.0 | 18603097.0 | 19187194.0 | ... | 55876504.0 | 56422274.0 | 56641209.0 | 57339635.0 | 58087055.0 | 58801927.0 | 59392255.0 | 59893885.0 | 60414495.0 | NaN |
| 264 | Zambia | ZMB | Population, total | SP.POP.TOTL | 3119430.0 | 3219451.0 | 3323427.0 | 3431381.0 | 3542764.0 | 3658024.0 | ... | 16248230.0 | 16767761.0 | 17298054.0 | 17835893.0 | 18380477.0 | 18927715.0 | 19473125.0 | 20017675.0 | 20569737.0 | NaN |
| 265 | Zimbabwe | ZWE | Population, total | SP.POP.TOTL | 3806310.0 | 3925952.0 | 4049778.0 | 4177931.0 | 4310332.0 | 4447149.0 | ... | 14154937.0 | 14452704.0 | 14751101.0 | 15052184.0 | 15354608.0 | 15669666.0 | 15993524.0 | 16320537.0 | 16665409.0 | NaN |
266 rows × 69 columns
This dataset covers Global Gini Indexes.
df16 = pd.read_csv('./data/API_SI.POV.GINI_DS2_en_csv_v2_31732.csv', skiprows=4)
df16
| Country Name | Country Code | Indicator Name | Indicator Code | 1960 | 1961 | 1962 | 1963 | 1964 | 1965 | ... | 2015 | 2016 | 2017 | 2018 | 2019 | 2020 | 2021 | 2022 | 2023 | Unnamed: 68 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Aruba | ABW | Gini index | SI.POV.GINI | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 1 | Africa Eastern and Southern | AFE | Gini index | SI.POV.GINI | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 2 | Afghanistan | AFG | Gini index | SI.POV.GINI | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 3 | Africa Western and Central | AFW | Gini index | SI.POV.GINI | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 4 | Angola | AGO | Gini index | SI.POV.GINI | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | 51.3 | NaN | NaN | NaN | NaN | NaN | NaN |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 261 | Kosovo | XKX | Gini index | SI.POV.GINI | NaN | NaN | NaN | NaN | NaN | NaN | ... | 26.5 | 26.7 | 29.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 262 | Yemen, Rep. | YEM | Gini index | SI.POV.GINI | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 263 | South Africa | ZAF | Gini index | SI.POV.GINI | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 264 | Zambia | ZMB | Gini index | SI.POV.GINI | NaN | NaN | NaN | NaN | NaN | NaN | ... | 55.8 | NaN | NaN | NaN | NaN | NaN | NaN | 51.5 | NaN | NaN |
| 265 | Zimbabwe | ZWE | Gini index | SI.POV.GINI | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | 44.3 | NaN | 50.3 | NaN | NaN | NaN | NaN | NaN |
266 rows × 69 columns
This dataset covers Global Income share held by highest 10%.
df17 = pd.read_csv('./data/API_SI.DST.10TH.10_DS2_en_csv_v2_12274.csv', skiprows=4)
df17
| Country Name | Country Code | Indicator Name | Indicator Code | 1960 | 1961 | 1962 | 1963 | 1964 | 1965 | ... | 2015 | 2016 | 2017 | 2018 | 2019 | 2020 | 2021 | 2022 | 2023 | Unnamed: 68 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Aruba | ABW | Income share held by highest 10% | SI.DST.10TH.10 | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 1 | Africa Eastern and Southern | AFE | Income share held by highest 10% | SI.DST.10TH.10 | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 2 | Afghanistan | AFG | Income share held by highest 10% | SI.DST.10TH.10 | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 3 | Africa Western and Central | AFW | Income share held by highest 10% | SI.DST.10TH.10 | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 4 | Angola | AGO | Income share held by highest 10% | SI.DST.10TH.10 | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | 39.6 | NaN | NaN | NaN | NaN | NaN | NaN |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 261 | Kosovo | XKX | Income share held by highest 10% | SI.DST.10TH.10 | NaN | NaN | NaN | NaN | NaN | NaN | ... | 21.5 | 21.8 | 24.6 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 262 | Yemen, Rep. | YEM | Income share held by highest 10% | SI.DST.10TH.10 | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 263 | South Africa | ZAF | Income share held by highest 10% | SI.DST.10TH.10 | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 264 | Zambia | ZMB | Income share held by highest 10% | SI.DST.10TH.10 | NaN | NaN | NaN | NaN | NaN | NaN | ... | 43.4 | NaN | NaN | NaN | NaN | NaN | NaN | 39.1 | NaN | NaN |
| 265 | Zimbabwe | ZWE | Income share held by highest 10% | SI.DST.10TH.10 | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | 34.8 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
266 rows × 69 columns
This dataset covers Income share held by highest 20%
df18 = pd.read_csv('./data/API_SI.DST.05TH.20_DS2_en_csv_v2_9120.csv', skiprows=4)
df18
| Country Name | Country Code | Indicator Name | Indicator Code | 1960 | 1961 | 1962 | 1963 | 1964 | 1965 | ... | 2015 | 2016 | 2017 | 2018 | 2019 | 2020 | 2021 | 2022 | 2023 | Unnamed: 68 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Aruba | ABW | Income share held by highest 20% | SI.DST.05TH.20 | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 1 | Africa Eastern and Southern | AFE | Income share held by highest 20% | SI.DST.05TH.20 | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 2 | Afghanistan | AFG | Income share held by highest 20% | SI.DST.05TH.20 | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 3 | Africa Western and Central | AFW | Income share held by highest 20% | SI.DST.05TH.20 | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 4 | Angola | AGO | Income share held by highest 20% | SI.DST.05TH.20 | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | 55.6 | NaN | NaN | NaN | NaN | NaN | NaN |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 261 | Kosovo | XKX | Income share held by highest 20% | SI.DST.05TH.20 | NaN | NaN | NaN | NaN | NaN | NaN | ... | 36.1 | 36.3 | 38.6 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 262 | Yemen, Rep. | YEM | Income share held by highest 20% | SI.DST.05TH.20 | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 263 | South Africa | ZAF | Income share held by highest 20% | SI.DST.05TH.20 | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 264 | Zambia | ZMB | Income share held by highest 20% | SI.DST.05TH.20 | NaN | NaN | NaN | NaN | NaN | NaN | ... | 60.2 | NaN | NaN | NaN | NaN | NaN | NaN | 56.4 | NaN | NaN |
| 265 | Zimbabwe | ZWE | Income share held by highest 20% | SI.DST.05TH.20 | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | 51.1 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
266 rows × 69 columns
This code covers Income share held by lowest 20%
df19 = pd.read_csv('./data/API_SI.DST.FRST.20_DS2_en_csv_v2_3154.csv', skiprows=4)
df19
| Country Name | Country Code | Indicator Name | Indicator Code | 1960 | 1961 | 1962 | 1963 | 1964 | 1965 | ... | 2015 | 2016 | 2017 | 2018 | 2019 | 2020 | 2021 | 2022 | 2023 | Unnamed: 68 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Aruba | ABW | Income share held by lowest 20% | SI.DST.FRST.20 | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 1 | Africa Eastern and Southern | AFE | Income share held by lowest 20% | SI.DST.FRST.20 | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 2 | Afghanistan | AFG | Income share held by lowest 20% | SI.DST.FRST.20 | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 3 | Africa Western and Central | AFW | Income share held by lowest 20% | SI.DST.FRST.20 | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 4 | Angola | AGO | Income share held by lowest 20% | SI.DST.FRST.20 | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | 3.8 | NaN | NaN | NaN | NaN | NaN | NaN |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 261 | Kosovo | XKX | Income share held by lowest 20% | SI.DST.FRST.20 | NaN | NaN | NaN | NaN | NaN | NaN | ... | 9.5 | 9.4 | 9.2 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 262 | Yemen, Rep. | YEM | Income share held by lowest 20% | SI.DST.FRST.20 | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 263 | South Africa | ZAF | Income share held by lowest 20% | SI.DST.FRST.20 | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 264 | Zambia | ZMB | Income share held by lowest 20% | SI.DST.FRST.20 | NaN | NaN | NaN | NaN | NaN | NaN | ... | 3.1 | NaN | NaN | NaN | NaN | NaN | NaN | 3.9 | NaN | NaN |
| 265 | Zimbabwe | ZWE | Income share held by lowest 20% | SI.DST.FRST.20 | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | 6.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
266 rows × 69 columns
This dataset covers Income share held by lowest 10%
df20 = pd.read_csv('./data/API_SI.DST.FRST.10_DS2_en_csv_v2_12272.csv', skiprows=4)
df20
| Country Name | Country Code | Indicator Name | Indicator Code | 1960 | 1961 | 1962 | 1963 | 1964 | 1965 | ... | 2015 | 2016 | 2017 | 2018 | 2019 | 2020 | 2021 | 2022 | 2023 | Unnamed: 68 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Aruba | ABW | Income share held by lowest 10% | SI.DST.FRST.10 | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 1 | Africa Eastern and Southern | AFE | Income share held by lowest 10% | SI.DST.FRST.10 | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 2 | Afghanistan | AFG | Income share held by lowest 10% | SI.DST.FRST.10 | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 3 | Africa Western and Central | AFW | Income share held by lowest 10% | SI.DST.FRST.10 | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 4 | Angola | AGO | Income share held by lowest 10% | SI.DST.FRST.10 | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | 1.3 | NaN | NaN | NaN | NaN | NaN | NaN |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 261 | Kosovo | XKX | Income share held by lowest 10% | SI.DST.FRST.10 | NaN | NaN | NaN | NaN | NaN | NaN | ... | 4.0 | 3.8 | 3.8 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 262 | Yemen, Rep. | YEM | Income share held by lowest 10% | SI.DST.FRST.10 | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 263 | South Africa | ZAF | Income share held by lowest 10% | SI.DST.FRST.10 | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 264 | Zambia | ZMB | Income share held by lowest 10% | SI.DST.FRST.10 | NaN | NaN | NaN | NaN | NaN | NaN | ... | 1.2 | NaN | NaN | NaN | NaN | NaN | NaN | 1.5 | NaN | NaN |
| 265 | Zimbabwe | ZWE | Income share held by lowest 10% | SI.DST.FRST.10 | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | 2.5 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
266 rows × 69 columns
https://www.iban.com/country-codes249 rows × 4 columnsM850011.csv, 63 rows × 12 columnsM850581.csv, 34 rows × 180 columnsM850671.csv, 15 rows × 7 columnsM850691.csv, 7 rows × 18 columnsM183401.csv, 32 rows × 19 columnsM015651.csv, 198 rows × 29 columns17892.csv, 25 rows × 2 columnsDEMO_DS_23082024021945252.csv, 27114 rows × 9 columnsNATMON_DS_23082024022406407.csv, 3781 rows × 9 columnsDataForTable2 (1).csv, 2363 rows × 11 columnsAPI_SE.XPD.TOTL.GD.ZS_DS2_en_csv_v2_68.csv, 266 rows × 69 columnsAPI_NY.GDP.MKTP.KD.ZG_DS2_en_csv_v2_59.csv, 266 rows × 69 columnsAPI_NY.GDP.MKTP.CD_DS2_en_csv_v2_31795.csv, 266 rows × 69 columnsAPI_SP.POP.TOTL_DS2_en_csv_v2_31753.csv, 266 rows × 69 columnsAPI_SI.POV.GINI_DS2_en_csv_v2_31732.csv, 266 rows × 69 columnsAPI_SI.DST.10TH.10_DS2_en_csv_v2_12274.csv, 266 rows × 69 columnsAPI_SI.DST.05TH.20_DS2_en_csv_v2_9120.csv, 266 rows × 69 columnsAPI_SI.DST.FRST.20_DS2_en_csv_v2_3154.csv, 266 rows × 69 columnsAPI_SI.DST.FRST.10_DS2_en_csv_v2_12272.csv, 266 rows × 69 columns
Do not manually perform any data cleaning steps — all data cleaning should be done programmatically, by Python code that you write. That is, you should not be searching in csv files in excel and filling in empty boxes or changing strings to numbers etc.
For data cleaning, be clear in which dataset (or variables) are used, what has been done for missing data, how was merging performed, explanation of data transformation (if any).
If data is calculated or summarized from the raw dataset, explain the rationale and steps clearly.
Please note to only show EDA that's relevant to answering the question at hand. If you have done any data modeling, include in this section.
In this section, we answer the question, What is the correlation between government expenditure on education and GDP growth?
Firstly, we will select the appropriate data needed to answer the question. In Exploratory Analysis, we attempt to gain insight into the question by analysing appropriate graphs and figures. This will help us gain a better understanding of the question, as well as the relevant data, before we plunge into Quantitative Analysis.
In Quantitative Analysis, we tackle the question head-on, and gain real figures, such as correlations, etc. that will help us answer the question.
We will be using the following 6 datasets for this question.
https://tablebuilder.singstat.gov.sg/table/TS/M850011#
- (SG) Government Expenditure On Education, from 1960 - 2022
https://tablebuilder.singstat.gov.sg/table/TS/M015651
- (SG) Gross Domestic Product At Current Prices, By Industry (SSIC 2020), from 1975 - 2024
https://data.worldbank.org/indicator/SE.XPD.TOTL.GD.ZS?_gl=1
- (Global), Government expenditure on education, total (% of GDP)
https://data.worldbank.org/indicator/NY.GDP.MKTP.KD.ZG?_gl=1
- (Global), GDP growth (annual %)
https://data.worldbank.org/indicator/NY.GDP.MKTP.CD
- (Global), GDP (Current US$)
https://data.worldbank.org/indicator/SP.POP.TOTL?name_desc=false
- (Global), Total Population
Firstly, we will clean the datasets. To maintain uniformity, we will convert everything to USD, where applicable
# Extracting the 'Total Government Expenditure On Education' column and converting values to float
# Multiply by 1000 to convert from thousand dollars, and by 0.78 to adjust for a certain factor (e.g., inflation or scaling).
sg_edu_expense = df1[["Total Government Expenditure On Education (Thousand Dollars)"]].astype(float) * 1000 * 0.78
# Renaming the column to 'Expenditure On Education' for clarity
sg_edu_expense.rename(columns={"Total Government Expenditure On Education (Thousand Dollars)": "Expenditure On Education"}, inplace=True)
# Converting the index to a datetime format, assuming the index represents years and removing any whitespace
sg_edu_expense.index = pd.to_datetime(sg_edu_expense.index.str.strip(), format='%Y')
# Displaying the first 5 rows of the modified DataFrame
display(sg_edu_expense.head(5))
# Providing information about the DataFrame's structure and data types
sg_edu_expense.info()
| Data Series | Expenditure On Education |
|---|---|
| 2022-01-01 | 1.018721e+10 |
| 2021-01-01 | 1.006973e+10 |
| 2020-01-01 | 9.562685e+09 |
| 2019-01-01 | 9.924155e+09 |
| 2018-01-01 | 1.004327e+10 |
<class 'pandas.core.frame.DataFrame'> DatetimeIndex: 63 entries, 2022-01-01 to 1960-01-01 Data columns (total 1 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Expenditure On Education 63 non-null float64 dtypes: float64(1) memory usage: 1008.0 bytes
# Extracting the year part from the index, converting it to float for time calculations
years = df6.index.str[:4].astype(float)
# Extracting the quarter part from the index and mapping it to decimal values (1Q = 0.25, 2Q = 0.50, etc.)
quarters = df6.index.str[5:7].map({'1Q': 0.25, '2Q': 0.50, '3Q': 0.75, '4Q': 1.00}).astype(float)
# Combining years and quarters to get a continuous time variable
time = years + quarters
# Selecting the 'GDP At Current Market Prices' column, converting values to float, and renaming the column to 'GDP'
sg_gdp = df6[['GDP At Current Market Prices']].astype(float).rename(columns={"GDP At Current Market Prices": "GDP"})
# Setting the new 'time' variable as the index for the GDP DataFrame
sg_gdp.index = time
# Scaling GDP values (converting to millions and adjusting by 0.78, perhaps for inflation or other adjustment)
sg_gdp *= 1e6 * 0.78
# Adjusting GDP values by adding the shifted values (lagged by 1, 2, and 3 periods) to account for past GDP values
sg_gdp['GDP'] += sg_gdp['GDP'].shift(1) + sg_gdp['GDP'].shift(2) + sg_gdp['GDP'].shift(3)
# Filtering the DataFrame to include only the rows where the index (time) represents a whole year (e.g., 2000.00, 2001.00)
sg_gdp = sg_gdp[sg_gdp.index % 1 == 0]
# Dropping rows with any missing values (NaNs) to ensure data cleanliness
sg_gdp.dropna(axis=0, inplace=True)
# Converting the index (year) back to datetime format for easier time-based plotting or analysis
sg_gdp.index = pd.to_datetime(sg_gdp.index.astype(int).astype(str))
# Displaying the first 5 rows of the cleaned GDP DataFrame
display(sg_gdp.head(5))
# Displaying information about the structure of the GDP DataFrame
sg_gdp.info()
| Data Series | GDP |
|---|---|
| 2023-01-01 | 5.217365e+11 |
| 2022-01-01 | 5.244607e+11 |
| 2021-01-01 | 4.320549e+11 |
| 2020-01-01 | 3.789621e+11 |
| 2019-01-01 | 4.028781e+11 |
<class 'pandas.core.frame.DataFrame'> DatetimeIndex: 48 entries, 2023-01-01 to 1976-01-01 Data columns (total 1 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 GDP 48 non-null float64 dtypes: float64(1) memory usage: 768.0 bytes
The above 2 dataframes have no null values. Therefore, we do not need to fill in any missing values. Let us proceed.
This converts a DataFrame of country codes and names into a dictionary where the country codes map to their corresponding country names.
# Selecting the first two columns from df11, assuming the first column contains country codes and the second contains country names
ISOcode_to_name = df11.iloc[:, :2]
# Setting the 'Country Code' column as the index of the DataFrame
ISOcode_to_name.set_index('Country Code', inplace=True)
# Removing the index name ('Country Code') for a cleaner DataFrame
ISOcode_to_name.index.name = None
# Displaying the first few rows of the DataFrame to check the transformation
display(ISOcode_to_name)
# Converting the DataFrame into a dictionary where the index (Country Code) is the key and 'Country Name' is the value
ISOcode_to_name = ISOcode_to_name.to_dict()
# Extracting the 'Country Name' dictionary from the above transformation
ISOcode_to_name = ISOcode_to_name['Country Name']
| Country Name | |
|---|---|
| ABW | Aruba |
| AFE | Africa Eastern and Southern |
| AFG | Afghanistan |
| AFW | Africa Western and Central |
| AGO | Angola |
| ... | ... |
| XKX | Kosovo |
| YEM | Yemen, Rep. |
| ZAF | South Africa |
| ZMB | Zambia |
| ZWE | Zimbabwe |
266 rows × 1 columns
This code cleans the global datasets by removing unnecessary columns, filtering rows, transposing the data for better usability, and finally displaying the first few rows of each cleaned DataFrame.
# Making copies of the original DataFrames to avoid modifying the original data
global_edu_expense_percent_gdp = df11.copy()
global_gdp_growth_percent = df12.copy()
global_gdp = df13.copy()
global_pop = df14.copy()
# Function to clean the DataFrames by removing unnecessary columns, setting index, and filtering rows
def clean_df(df):
# Dropping irrelevant columns ('Country Name', 'Indicator Name', 'Indicator Code', 'Unnamed: 68')
df.drop(columns=['Country Name', 'Indicator Name', 'Indicator Code', 'Unnamed: 68'], inplace=True)
# Setting the 'Country Code' column as the index
df.set_index('Country Code', inplace=True)
# Removing the index name for clarity
df.index.name = None
# Dropping rows where the country code is not present in 'country_codes_dict'
df.drop(df[~df.index.isin(country_codes_dict)].index, inplace=True)
# Transposing the DataFrame to have years as rows and country codes as columns
df = df.T
# Converting the year index to datetime format
df.index = pd.to_datetime(df.index.astype(int).astype(str))
# Returning the cleaned DataFrame
return df
# Cleaning all global datasets using the clean_df function
global_edu_expense_percent_gdp = clean_df(global_edu_expense_percent_gdp)
global_gdp_growth_percent = clean_df(global_gdp_growth_percent)
global_gdp = clean_df(global_gdp)
global_pop = clean_df(global_pop)
# Displaying the first three rows of the cleaned Global Educational Expenditure as a Percent of GDP DataFrame
print("\n\033[1mGlobal Educational Expenditure as a Percent of GDP\033[0m")
display(global_edu_expense_percent_gdp.head(3))
# Displaying the first three rows of the cleaned Global GDP Growth (annual %) DataFrame
print("\033[1mGlobal GDP Growth (annual %)\033[0m")
display(global_gdp_growth_percent.head(3))
# Displaying the first three rows of the cleaned Global GDP (USD) DataFrame
print("\033[1mGlobal GDP (USD)\033[0m")
display(global_gdp.head(3))
# Displaying the first three rows of the cleaned Global Population DataFrame
print("\033[1mGlobal Population\033[0m")
display(global_pop.head(3))
Global Educational Expenditure as a Percent of GDP
| ABW | AFG | AGO | ALB | AND | ARE | ARG | ARM | ASM | ATG | ... | VEN | VGB | VIR | VNM | VUT | WSM | YEM | ZAF | ZMB | ZWE | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1960-01-01 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 1961-01-01 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 1962-01-01 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
3 rows × 215 columns
Global GDP Growth (annual %)
| ABW | AFG | AGO | ALB | AND | ARE | ARG | ARM | ASM | ATG | ... | VEN | VGB | VIR | VNM | VUT | WSM | YEM | ZAF | ZMB | ZWE | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1960-01-01 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 1961-01-01 | NaN | NaN | NaN | NaN | NaN | NaN | 5.427843 | NaN | NaN | NaN | ... | 3.192519 | NaN | NaN | NaN | NaN | NaN | NaN | 3.844734 | 1.361382 | 6.316157 |
| 1962-01-01 | NaN | NaN | NaN | NaN | NaN | NaN | -0.852022 | NaN | NaN | NaN | ... | 8.532934 | NaN | NaN | NaN | NaN | NaN | NaN | 6.177931 | -2.490839 | 1.434471 |
3 rows × 215 columns
Global GDP (USD)
| ABW | AFG | AGO | ALB | AND | ARE | ARG | ARM | ASM | ATG | ... | VEN | VGB | VIR | VNM | VUT | WSM | YEM | ZAF | ZMB | ZWE | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1960-01-01 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | 7.663938e+09 | NaN | NaN | NaN | NaN | NaN | NaN | 8.748597e+09 | 7.130000e+08 | 1.052990e+09 |
| 1961-01-01 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | 8.067267e+09 | NaN | NaN | NaN | NaN | NaN | NaN | 9.225996e+09 | 6.962857e+08 | 1.096647e+09 |
| 1962-01-01 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | 8.814310e+09 | NaN | NaN | NaN | NaN | NaN | NaN | 9.813996e+09 | 6.931429e+08 | 1.117602e+09 |
3 rows × 215 columns
Global Population
| ABW | AFG | AGO | ALB | AND | ARE | ARG | ARM | ASM | ATG | ... | VEN | VGB | VIR | VNM | VUT | WSM | YEM | ZAF | ZMB | ZWE | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1960-01-01 | 54608.0 | 8622466.0 | 5357195.0 | 1608800.0 | 9443.0 | 133426.0 | 20349744.0 | 1904148.0 | 20085.0 | 55342.0 | ... | 8156937.0 | 7850.0 | 32500.0 | 32718461.0 | 64608.0 | 113335.0 | 5542459.0 | 16520441.0 | 3119430.0 | 3806310.0 |
| 1961-01-01 | 55811.0 | 8790140.0 | 5441333.0 | 1659800.0 | 10216.0 | 140984.0 | 20680653.0 | 1971530.0 | 20626.0 | 56245.0 | ... | 8453106.0 | 7885.0 | 34300.0 | 33621982.0 | 66462.0 | 116820.0 | 5646668.0 | 16989464.0 | 3219451.0 | 3925952.0 |
| 1962-01-01 | 56682.0 | 8969047.0 | 5521400.0 | 1711319.0 | 11014.0 | 148877.0 | 21020359.0 | 2039346.0 | 21272.0 | 57008.0 | ... | 8754082.0 | 7902.0 | 35000.0 | 34533889.0 | 68391.0 | 120163.0 | 5753386.0 | 17503133.0 | 3323427.0 | 4049778.0 |
3 rows × 215 columns
This code calculates Global GDP per Capita by dividing the GDP values by the corresponding population values, and then displays the first 5 rows of the result.
# Calculating Global GDP per Capita by dividing global GDP by global population
global_gdp_per_capita = global_gdp / global_pop
# Printing and displaying the first 5 rows of the calculated Global GDP per Capita
print("\n\033[1mGlobal GDP per Capita\033[0m")
display(global_gdp_per_capita.head(5))
Global GDP per Capita
| ABW | AFG | AGO | ALB | AND | ARE | ARG | ARM | ASM | ATG | ... | VEN | VGB | VIR | VNM | VUT | WSM | YEM | ZAF | ZMB | ZWE | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1960-01-01 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | 939.560806 | NaN | NaN | NaN | NaN | NaN | NaN | 529.561923 | 228.567399 | 276.643363 |
| 1961-01-01 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | 954.355361 | NaN | NaN | NaN | NaN | NaN | NaN | 543.042224 | 216.274674 | 279.332656 |
| 1962-01-01 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | 1006.879977 | NaN | NaN | NaN | NaN | NaN | NaN | 560.699394 | 208.562685 | 275.966139 |
| 1963-01-01 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | 1060.570324 | NaN | NaN | NaN | NaN | NaN | NaN | 601.599951 | 209.453362 | 277.532515 |
| 1964-01-01 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | 874.199411 | NaN | NaN | NaN | NaN | NaN | NaN | 642.688431 | 236.941713 | 282.376856 |
5 rows × 215 columns
This code reshapes, merges, and processes data to calculate relevant metrics (such as education expenditure and expenditure per capita) and cleans up the final dataset for further analysis.
# Reshaping the global population data into long format (year, country_code, population)
temp_pop = global_pop.reset_index().melt(id_vars='index').rename(columns={"index": "year", "variable": "country_code", "value": "population"})
# Reshaping the global GDP data into long format (year, country_code, GDP)
temp_gdp = global_gdp.reset_index().melt(id_vars='index').rename(columns={"index": "year", "variable": "country_code", "value": "GDP"})
# Reshaping the global GDP per capita data into long format (year, country_code, GDP_per_capita)
temp_gdp_per_capita = global_gdp_per_capita.reset_index().melt(id_vars='index').rename(columns={"index": "year", "variable": "country_code", "value": "GDP_per_capita"})
# Reshaping the global education expenditure as a percent of GDP data into long format (year, country_code, edu_expense_per_GDP)
temp_edu_expense_percent_gdp = global_edu_expense_percent_gdp.reset_index().melt(id_vars='index').rename(columns={"index": "year", "variable": "country_code", "value": "edu_expense_per_GDP"})
# Merging all datasets into one (year, country_code), adding population, GDP, GDP_per_capita, and edu_expense_per_GDP for each combination
q1_df = temp_pop.merge(temp_gdp, on=['year', 'country_code'])\
.merge(temp_gdp_per_capita, on=['year', 'country_code'])\
.merge(temp_edu_expense_percent_gdp, on=['year', 'country_code'])
# Calculating the total education expenditure based on the percentage of GDP
q1_df['edu_expense'] = q1_df['edu_expense_per_GDP'] * q1_df['GDP'] / 100
# Calculating the education expenditure per capita by dividing the total expenditure by the population
q1_df['edu_expense_per_capita'] = q1_df['edu_expense'] / q1_df['population']
# Mapping country codes to country names using a dictionary
q1_df['country_name'] = q1_df['country_code'].map(country_codes_dict)
# Creating a raw backup of the original DataFrame before dropping missing values
q1_df_raw = q1_df
# Dropping rows with missing values and resetting the index for the cleaned dataset
q1_df = q1_df.dropna().reset_index(drop=True)
# Display the cleaned dataset
q1_df
| year | country_code | population | GDP | GDP_per_capita | edu_expense_per_GDP | edu_expense | edu_expense_per_capita | country_name | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 1998-01-01 | ABW | 84355.0 | 1.665363e+09 | 19742.316739 | 4.76316 | 7.932391e+07 | 940.358179 | Aruba |
| 1 | 1999-01-01 | ABW | 86867.0 | 1.722905e+09 | 19833.826746 | 4.39527 | 7.572633e+07 | 871.750211 | Aruba |
| 2 | 2000-01-01 | ABW | 89101.0 | 1.873453e+09 | 21026.167091 | 4.71536 | 8.834003e+07 | 991.459507 | Aruba |
| 3 | 2001-01-01 | ABW | 90691.0 | 1.896457e+09 | 20911.192767 | 4.79850 | 9.100149e+07 | 1003.423598 | Aruba |
| 4 | 2002-01-01 | ABW | 91781.0 | 1.961844e+09 | 21375.269123 | 4.87178 | 9.557670e+07 | 1041.356069 | Aruba |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 4858 | 2014-01-01 | ZWE | 13855753.0 | 1.949552e+10 | 1407.034291 | 6.13835 | 1.196703e+09 | 86.368690 | Zimbabwe |
| 4859 | 2015-01-01 | ZWE | 14154937.0 | 1.996312e+10 | 1410.329173 | 5.81300 | 1.160456e+09 | 81.982438 | Zimbabwe |
| 4860 | 2016-01-01 | ZWE | 14452704.0 | 2.054868e+10 | 1421.787791 | 5.47300 | 1.124629e+09 | 77.814447 | Zimbabwe |
| 4861 | 2017-01-01 | ZWE | 14751101.0 | 1.758489e+10 | 1192.107012 | 5.81878 | 1.023226e+09 | 69.366084 | Zimbabwe |
| 4862 | 2018-01-01 | ZWE | 15052184.0 | 3.415607e+10 | 2269.177012 | 2.05049 | 7.003668e+08 | 46.529246 | Zimbabwe |
4863 rows × 9 columns
Let's look at a graph displaying the Singapore data relevant to this question
# Create a figure and the first subplot (ax1) with a specific figure size
fig, ax1 = plt.subplots(figsize=(10, 6))
# Plot Singapore education expenditure over time (in billion USD) on ax1 using a blue line
ax1.plot(sg_edu_expense.index, sg_edu_expense / 1e9, color='blue')
# Set the label for the y-axis on ax1, representing education expenditure, with the label in blue color
ax1.set_ylabel("Education Expenditure (Billion USD)", color='blue')
# Change the color of the y-axis tick labels on ax1 to blue
ax1.tick_params(axis='y', labelcolor='blue')
# Create a second y-axis (ax2) that shares the same x-axis as ax1 for plotting GDP
ax2 = ax1.twinx()
# Plot Singapore GDP over time (in billion USD) on ax2 using a green line
ax2.plot(sg_gdp.index, sg_gdp / 1e9, color='green')
# Set the label for the y-axis on ax2, representing GDP, with the label in green color
ax2.set_ylabel("Singapore GDP (Billion USD)", color='green')
# Change the color of the y-axis tick labels on ax2 to green
ax2.tick_params(axis='y', labelcolor='green')
# Display the final plot with both lines (education expenditure and GDP)
plt.title('SGP GDP and Edu Expenditure over Time')
plt.show()
From the graph, it is quite clear that there is a strong correlation between Singapore GDP and Education Expenditure. Now, lets take a look at some global data.
# Create a new figure object for the subplots
fig = plt.figure()
# Add a subplot (ax0) on the left side (1 row, 2 columns, 1st subplot)
ax0 = fig.add_subplot(1, 2, 1)
# Add a subplot (ax1) on the right side (1 row, 2 columns, 2nd subplot)
ax1 = fig.add_subplot(1, 2, 2)
# Create a box plot of the average global education expenditure (% of GDP) over the last 20 years
# Use the color blue and make the plot horizontal, then place it in the first subplot (ax0)
global_edu_expense_percent_gdp.iloc[-20:, :].mean().plot(kind='box', color='blue', vert=False, figsize=(20, 8), ax=ax0)
# Set the title for the first subplot (ax0)
ax0.set_title('Box Plot of Global Education Expenditure (% of GDP)')
# Set the x-axis label for the first subplot (ax0)
ax0.set_xlabel('Education Expenditure (% of GDP)')
# Create a histogram of the average global education expenditure (% of GDP) over the last 20 years
# Place the histogram in the second subplot (ax1)
global_edu_expense_percent_gdp.iloc[-20:, :].mean().plot(kind='hist', figsize=(20, 8), ax=ax1)
# Set the title for the second subplot (ax1)
ax1.set_title('Distribution of Global Education Expenditure (% of GDP)')
# Set the y-axis label for the second subplot (ax1)
ax1.set_ylabel('Number of Countries')
# Set the x-axis label for the second subplot (ax1)
ax1.set_xlabel('Education Expenditure (% of GDP)')
# Set a main title for the entire figure
fig.suptitle('Global Education Expenditure Analysis', fontsize=16)
# Display the figure with both subplots
plt.show()
From the graphs, it is quite clear that the vast majority of countries spend between 3-5% of their GDP on Education. However, note that this also means that most countries GDP are 20-30 times their Education Expenditure. Now, we will explore the variations between the countries, and figure out which countries spend less on their Education (as a percent of their GDP) and which countries spend more.
The below code shows an animated chloropleth graph which helps to visualise the Education Expenditure as a percent of GDP over countries and time.
# Fill forward missing values in the global_edu_expense_percent_gdp dataframe to handle missing data
map_data = global_edu_expense_percent_gdp.ffill().reset_index()
# Reshape the data using melt to make it long-form for plotting
# 'index' becomes 'year', and country ISO codes will be the 'iso_code'
map_data = map_data.melt(id_vars='index').rename(columns={"variable": "iso_code", "index": "year"})
# Map the ISO country codes to their corresponding country names using a dictionary
map_data['country_name'] = map_data['iso_code'].map(country_codes_dict)
# Convert the 'year' column to only include the year (dropping month/day if present)
map_data.year = map_data.year.dt.year
# Create an animated choropleth map using Plotly Express
px.choropleth(
map_data, # Input dataset
locations="iso_code", # The ISO codes to be used for location
color="value", # Column to represent in color scale (education expenditure as % of GDP)
hover_name="country_name", # The name to display when hovering over a country
animation_frame="year", # Column for creating the year-by-year animation
color_continuous_scale=px.colors.diverging.RdBu, # Color scale for the choropleth
height=600, # Height of the figure
range_color=[3, 7], # Set the range for the color scale (3% to 7% for education expenditure)
title="Edu Expenditure (% GDP) over Time"
)
From the graph (from 2005 - 2023), we can see that most developed countries, such as USA, Australia, Sweden, Finland, France all have high education expenditure (blue) whereas most underdeveloped countries, such as countries in Africa and Asia have low education expenditure (brown). We will explore this correlation in the Quantitative Analysis section.
The graphs below show a treemap of GDP, Population GDP per Capita in 2023
treemap_data = global_gdp.reset_index().melt(id_vars='index').rename(columns={"index": "year", "variable": "iso_code"})
treemap_data = treemap_data[treemap_data.year=='2023']
treemap_data['country_name'] = treemap_data['iso_code'].map(country_codes_dict)
fig = px.treemap(
treemap_data,
path=['iso_code'], # Path defines the hierarchy (here we only use Country)
values='value', # The size of the rectangles will be based on GDP
color='value', # The color of the rectangles will be based on Population
hover_data=['country_name', 'value'], # Data to show on hover
color_continuous_scale='Plasma', # Diverging color scale
title="Treemap of GDP"
)
fig.show()
treemap_data = global_pop.reset_index().melt(id_vars='index').rename(columns={"index": "year", "variable": "iso_code"})
treemap_data = treemap_data[treemap_data.year=='2023']
treemap_data['country_name'] = treemap_data['iso_code'].map(country_codes_dict)
fig = px.treemap(
treemap_data,
path=['iso_code'], # Path defines the hierarchy (here we only use Country)
values='value', # The size of the rectangles will be based on GDP
color='value', # The color of the rectangles will be based on Population
hover_data=['country_name', 'value'], # Data to show on hover
color_continuous_scale='Plasma', # Diverging color scale
title="Treemap of Population"
)
fig.show()
treemap_data = global_gdp_per_capita.reset_index().melt(id_vars='index').rename(columns={"index": "year", "variable": "iso_code"})
treemap_data = treemap_data[treemap_data.year=='2023']
treemap_data['country_name'] = treemap_data['iso_code'].map(country_codes_dict)
fig = px.treemap(
treemap_data,
path=['iso_code'], # Path defines the hierarchy (here we only use Country)
values='value', # The size of the rectangles will be based on GDP
color='value', # The color of the rectangles will be based on Population
hover_data=['country_name', 'value'], # Data to show on hover
color_continuous_scale='Plasma', # Diverging color scale
title="Treemap of GDP Per Capita"
)
fig.show()
We now have the choice of either finding correlatiion between education expenditure on GDP or GDP per Capita. We will go with GDP per Capita since this would not be inflated by the country's population. For example, Singapore has a relatively low gross GDP even though it has a high education expenditure (as a percent of GDP). This is mainly because of its low population. Therefore, factoring out the population will allow us to determine the true impact of education expenditure on GDP.
Now, we will do some quantitative analysis between education expenditure and GDP
In this section, we will proceed with quantitative analysis. In other words, we will justify (quantitatively) some of the claims in the previous section. The DataFrame below is what we will be using throughout this section.
plt.figure(figsize=(8, 6))
sns.heatmap(q1_df.iloc[:, 2:-1].corr(), annot=True, fmt=".2f", cmap='coolwarm')
plt.title('Heatmap of Correlation Between Numerical Variables')
plt.show()
Right away, notice the strong positive correlations between Educational Expenditure and GDP as well as their per_capita counterparts. This clearly shows how Educational Expenditure is tied very strongly to GDP.
Now, put yourself in the government's shoes. What one really wants to know is how much of the GDP to allocate to Education, in order to reap future rewards in GDP (per capita). The relevant columns in the DataFrame above are edu_expense_per_GDP and GDP_per_capita. Although they show a correlation of 0.06, we will proceed to investigate this further. Note that this is very relevant to the overarching research question, What is the quantitative impact of government expenditure on education on GDP growth?
Firstly, let us now explore the impact of some countries with low population (less than 1 million)
q1_df[q1_df.population <= 1e6] \
.groupby(['country_code', 'country_name'])[['population', 'GDP_per_capita']] \
.mean() \
.sort_values('GDP_per_capita', ascending=False) \
.head(10)
| population | GDP_per_capita | ||
|---|---|---|---|
| country_code | country_name | ||
| MCO | Monaco | 33437.050000 | 136853.732749 |
| LIE | Liechtenstein | 34988.166667 | 123122.192944 |
| CYM | Cayman Islands (the) | 65509.000000 | 87539.048125 |
| BMU | Bermuda | 62325.333333 | 77884.442188 |
| LUX | Luxembourg | 456808.763158 | 63137.489784 |
| SMR | San Marino | 33483.909091 | 49320.200374 |
| ISL | Iceland | 310283.833333 | 47582.679899 |
| MAC | Macao | 547816.125000 | 47115.717361 |
| AND | Andorra | 74414.761905 | 41828.512092 |
| BHS | Bahamas (the) | 392210.428571 | 29106.289630 |
Clearly, countries with very low population should not be part of our analysis. This is because such countries serve a very different purpose as compared to the rest. For example, Monaco, The Cayman Islands and The Bahamas are not exactly famous for their education. Therefore, in the interest of more conventional nations, such as Singapore, China, India, USA, we will exclude such countries from our analysis.
Secondly, let us investigate the effect of nations with low GDP per capita (less that $5000 USD)
q1_df[q1_df.GDP_per_capita <= 5000] \
.groupby(['country_code', 'country_name'])[['GDP_per_capita']] \
.mean() \
.head(10)
| GDP_per_capita | ||
|---|---|---|
| country_code | country_name | |
| AFG | Afghanistan | 516.019290 |
| AGO | Angola | 2353.842443 |
| ALB | Albania | 2650.234690 |
| ARG | Argentina | 3426.489600 |
| ARM | Armenia | 2790.690035 |
| AUT | Austria | 3177.055201 |
| AZE | Azerbaijan | 2138.250456 |
| BDI | Burundi | 200.538067 |
| BEN | Benin | 954.686457 |
| BFA | Burkina Faso | 550.390259 |
Most of such countries are classified as developing or underdeveloped. In such countries, there are many other factors at play in determining the GDP, due to their inherent instability. Therefore, the percentage of GDP they devote to education will not have a strong influence on the GDP growth. Since this analysis is more geared towards developed nations, we will also exclude such countries from our analysis.
Now, lets view a scatterplot of GDP_per_capita vs edu_expense_per_GDP, after excluding countries with low population (< 1 million) and low GDP_per_capita (< $5000 USD)
q1_df_developed = q1_df[(q1_df['population'] >= 1000000) & (q1_df['GDP_per_capita'] >= 5000)]
fig = sns.regplot(data=q1_df_developed, x='edu_expense_per_GDP', y='GDP_per_capita')
plt.xlabel('Edu Expense (% GDP)')
plt.ylabel('GDP per Capita')
plt.title('GDP per Capita vs Edu Expense (% GDP)')
plt.show()
#calculate correlation
pearson_coef, p_value = stats.pearsonr(q1_df_developed['edu_expense_per_GDP'], q1_df_developed['GDP_per_capita'])
print("Pearson Coefficient: ", pearson_coef)
print("P-Value: ", p_value)
Pearson Coefficient: 0.21328518427710413 P-Value: 5.305722840202821e-19
We now get that Education Expenditure as a percent of GDP is weakly related to GDP_per_capita (with high certainty).
In the last decade, there have been many advances in technology, and a country's GDP is getting ever more closely related to its population's literacy. There has been growing demand from many new sectors, and new technologies have been replacing otherwise mundane tasks. Therefore, to wrap up our discussion for this question, let us delve into the correlation between Education Expenditure as a percent of GDP and GDP_per_capita during the last decade.
mean_edu_expense = pd.DataFrame(global_edu_expense_percent_gdp.iloc[-10:, :].mean(axis=0), columns=['Mean Education Expenditure (% of GDP)'])
mean_gdp_per_capita = pd.DataFrame(global_gdp_per_capita.iloc[-10:, :].mean(axis=0), columns=['Mean GDP Per Capita'])
mean_pop = pd.DataFrame(global_pop.iloc[-10:, :].mean(axis=0), columns=['Mean Population'])
merged_df = pd.merge(mean_edu_expense, mean_gdp_per_capita, left_index=True, right_index=True).merge(mean_pop, right_index=True, left_index=True)
def pearson_coef(min_pop, min_gdp_per_capita):
return merged_df[(merged_df['Mean Population'] >= min_pop) & (merged_df['Mean GDP Per Capita'] >= min_gdp_per_capita)].corr().iloc[0, 1]
pop_ranges = range(10000000, 1000000, -1000000)
gdp_ranges = range(5000, 10001, 1000)
correlation_matrix = pd.DataFrame(index=pop_ranges, columns=gdp_ranges)
for min_pop in pop_ranges:
for min_gdp in gdp_ranges:
correlation_matrix.loc[min_pop, min_gdp] = pearson_coef(min_pop, min_gdp)
correlation_matrix = correlation_matrix.apply(pd.to_numeric, errors='coerce')
plt.figure(figsize=(12, 8))
sns.heatmap(correlation_matrix, annot=True, fmt=".2f", cmap='coolwarm', cbar=True,
xticklabels=[int(gdp / 1000) for gdp in gdp_ranges], yticklabels=[int(pop / 1e6) for pop in pop_ranges])
plt.title('Correlation between Education Expenditure (% GDP) and GDP per Capita')
plt.xlabel('Min GDP per Capita Threshold (Thousand USD)')
plt.ylabel('Min Population Threshold (Millions)')
plt.show()
From the graph, we can see that as we increase the Min Population Threshold and the Min GDP per capita Threshold, the correlation between Education Expenditure as a percent of GDP and GDP_per_capita gets stronger and stronger. This finally brings us to the conclusion to this question.
- A country's GDP has a very strong correlation with its expenditure on education
- The percentage of GDP a country decides to invest in education has a correlation with its GDP per capita. As a country gets bigger and more developed, this correlation gets stronger.
In this section, we will aim to answer the question, To what extent does government expenditure on education reduce income inequality, as measured by the Gini coefficient?
Firstly, an overview of the indicators:
Gini Index (SI.POV.GINI) Measures income inequality on a scale of 0 to 100. A value of 0 indicates perfect equality, while 100 represents perfect inequality.
Income Share Held by Highest 10% (SI.DST.10TH.10) Percentage of total income earned by the top 10% of the population. A higher percentage indicates greater income concentration and inequality.
Income Share Held by Highest 20% (SI.DST.05TH.20) Percentage of total income held by the top 20%. Similar to the previous indicator, it reflects overall income distribution and inequality.
Income Share Held by Lowest 10% (SI.DST.FRST.10) Percentage of total income received by the lowest 10%. A lower value indicates poorer income distribution and highlights poverty issues.
Income Share Held by Lowest 20% (SI.DST.FRST.20) Measures income share of the bottom 20%. Like the lowest 10%, it signals income inequality and the economic conditions of disadvantaged groups.
In Exploratory Analysis, we will be examining these indicators in turn, and what they reveal about the overarching question. We will be examining the data via plots and graphs.
In Quantitative Analysis, we will be tacking the question head-on, trying to unearth the underlying reasons behind our suspicions.
We will be using the following datasets for this question
https://data.worldbank.org/indicator/SI.POV.GINI
- (Global), Gini Index
https://data.worldbank.org/indicator/SI.DST.10TH.10
- (Global), Income share held by highest 10%
https://data.worldbank.org/indicator/SI.DST.05TH.20
- (Global), Income share held by highest 20%
https://data.worldbank.org/indicator/SI.DST.FRST.10
- (Global), Income share held by lowest 10%
https://data.worldbank.org/indicator/SI.DST.FRST.20
- (Global), Income share held by lowest 20%
https://data.worldbank.org/indicator/NY.GDP.MKTP.CD
- (Global), GDP (Current US$)
https://data.worldbank.org/indicator/SP.POP.TOTL?name_desc=false
- (Global), Total Population
https://data.worldbank.org/indicator/SE.XPD.TOTL.GD.ZS?_gl=1
- (Global), Government expenditure on education, total (% of GDP)
gini = clean_df(df16.copy())
top_10 = clean_df(df17.copy())
top_20 = clean_df(df18.copy())
bottom_20 = clean_df(df19.copy())
bottom_10 = clean_df(df20.copy())
edu_expense = global_edu_expense_percent_gdp * global_gdp / 100
edu_expense_per_capita = edu_expense / global_pop
Firstly, lets look at some mean income inequality indicators over that past decade
mean_gini = gini.iloc[-10:, :].mean()
mean_top_10 = top_10.iloc[-10:, :].mean()
mean_top_20 = top_20.iloc[-10:, :].mean()
mean_bottom_20 = bottom_20.iloc[-10:, :].mean()
mean_bottom_10 = bottom_10.iloc[-10:, :].mean()
merged_mean = pd.DataFrame(data={"Gini Index": mean_gini, "Top 10%": mean_top_10, "Top 20%": mean_top_20, "Bottom 10%": mean_bottom_10, "Bottom 20%": mean_bottom_20})
fig = sns.violinplot(merged_mean)
plt.ylabel('Number of Countries')
plt.title("Boxplots of Income Inequality Indicators")
plt.show()
The graph above shows how there is a clear disparity between the Income groups. What we will investigate in the following sections is how Education Expenditure correlates with this disparity
map_data = gini.ffill().reset_index().melt(id_vars='index').rename(columns={"variable": "iso_code", "index": "year"})
map_data['country_name'] = map_data['iso_code'].map(country_codes_dict)
map_data = map_data[map_data.year.dt.year == 2023]
fig = px.choropleth(map_data,
locations="iso_code",
color="value",
hover_name="country_name",
color_continuous_scale=px.colors.diverging.RdBu_r,
height=600,
range_color=[20, 60]
)
fig.update_layout(
title_text='Gini Index in 2023', # Set your title here
title_x=0.5 # Center the title (optional)
)
fig.show()
map_data = top_10.ffill().reset_index().melt(id_vars='index').rename(columns={"variable": "iso_code", "index": "year"})
map_data['country_name'] = map_data['iso_code'].map(country_codes_dict)
map_data = map_data[map_data.year.dt.year == 2023]
fig = px.choropleth(map_data,
locations="iso_code",
color="value",
hover_name="country_name",
color_continuous_scale=px.colors.diverging.RdBu_r,
height=600,
range_color=[10, 50]
)
fig.update_layout(
title_text='Top 10% Share in 2023', # Set your title here
title_x=0.5 # Center the title (optional)
)
fig.show()
The graph above depicts how developing countries tend to have higher Gini coefficients and Top 10% Share than developed countries. This means income inequality is often more pronounced in developing countries.
However, it’s important to note that there are exceptions. Some developed countries, like the United States (0.41), have relatively high Gini coefficients compared to other developed nations.
Now, lets turn our attention to the predictor variable, Educational Expenditure per Capita. As in Question 1, we will use Educational Expenditure per Capita instead of Educational Expenditure to fairly assess the impact of investing in education on Income Inequality
fig = plt.figure()
ax0 = fig.add_subplot(1, 2, 1)
ax1 = fig.add_subplot(1, 2, 2)
# Subplot 1: Box plot
edu_expense_per_capita.iloc[-20:, :].mean().plot(kind='box', color='blue', vert=False, figsize=(20, 8), ax=ax0) # add to subplot 1
ax0.set_title('Box Plot of Global Education Expenditure per Capita')
ax0.set_xlabel('Education Expenditure per Capita')
# Subplot 2: Histogram plot
edu_expense_per_capita.iloc[-20:, :].mean().plot(kind='hist', figsize=(20, 8), ax=ax1) # add to subplot 2
ax1.set_title('Distribution of Global Education Expenditure per Capita')
ax1.set_ylabel('Frequency')
ax1.set_xlabel('Education Expenditure per Capita')
fig.suptitle('Global Education Expenditure (per Capita) Analysis', fontsize=16)
plt.show()
Clearly, Educational Expenditure per Capita is heavily skewed to the right. We will now proceed to investigate this further.
The graph below depicts a treemap of Mean Education Expenditure per Capita over the Last Decade. This will allow us to glimpse the countries with high/low Mean Education Expenditure per Capita over the Last Decade.
treemap_data = edu_expense_per_capita.iloc[-10:, :].mean().reset_index().rename(columns={"index": "iso_code", 0:'value'})
treemap_data['country_name'] = treemap_data['iso_code'].map(country_codes_dict)
fig = px.treemap(
treemap_data,
path=['iso_code'], # Path defines the hierarchy (here we only use Country)
values='value', # The size of the rectangles will be based on GDP
color='value', # The color of the rectangles will be based on Population
hover_data=['country_name', 'value'], # Data to show on hover
color_continuous_scale='Plasma', # Diverging color scale
title="Mean Education Expenditure per Capita over the Last Decade"
)
fig.show()
From the graph above, we can begin to suspect that countries that have low population tend to have high Mean Education Expenditure per Capita and countries that are underdeveloped tend to have very low Mean Education Expenditure per Capita.
To confirm this, let us look at a Chloropleth graph displaying Education Expenditure per Capita in 2023
map_data = edu_expense_per_capita.ffill().reset_index().melt(id_vars='index').rename(columns={"variable": "iso_code", "index": "year"})
map_data = map_data[map_data.year.dt.year == 2023]
map_data['country_name'] = map_data['iso_code'].map(country_codes_dict)
fig = px.choropleth(map_data,
locations="iso_code",
color="value",
hover_name="country_name",
color_continuous_scale=px.colors.diverging.RdBu,
height=600,
range_color=[0, 2000]
)
fig.update_layout(
title_text='Education Expenditure per Capita in 2023', # Set your title here
title_x=0.5 # Center the title (optional)
)
fig.show()
From the graph above, we can see that North America and most of Europe tend to have high Education Expenditure per Capita. On the other hand, most of Africa has very low Education Expenditure per Capita. Therefore, the graph above confirms that the development of a country has a strong correlation with its Education Expenditure per Capita. However, one thing to note is that the difference in Education Expenditure per Capita between highly developed and underdeveloped countries is quite large (80-90 times).
Finally, the graph below depicts a plot of Gini Index over time, for various countries. Additionally, the countries have been coloured such that the ones with low educational expenditure are colored red while the ones with high educational expenditure are colored blue
plt.figure(figsize=(9, 7))
inv_color_dict = edu_expense_per_capita.iloc[-10:, :].mean().dropna().sort_values(ascending=False).reset_index().drop(0, axis=1).to_dict()['index']
keys = np.array(list(inv_color_dict.keys()))
values = np.array(list(inv_color_dict.values()))
color_dict = dict(zip(values, keys))
# Get a colormap
cmap = plt.get_cmap('coolwarm')
norm = plt.Normalize(0, len(color_dict))
# Plot each country's Gini index as a scatter plot, only if the country exists in color_dict
for country in gini.columns:
if country in color_dict: # Check if the country is in the color_dict
color_value = norm(color_dict[country]) # Normalize the country index
sns.scatterplot(x=gini.index.year, y=gini[country], color=cmap(color_value), label=country)
# Add a legend
plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left', ncol=6)
plt.ylabel("Gini Index")
plt.xlabel("Time")
plt.show()
The graph depicts a transition from red or grey to blue as we move downwards. This hints at a positive correlation between Educational Expenditure and Gini Index, something we will investigate in more detail in the Quantitative Analysis section
gini_melted = gini.reset_index().melt(id_vars='index').rename(columns={"index": "year", "variable": 'iso_code', 'value': 'gini_index'})
top_10_melted = top_10.reset_index().melt(id_vars='index').rename(columns={"index": "year", "variable": 'iso_code', 'value': 'top_10'})
top_20_melted = top_20.reset_index().melt(id_vars='index').rename(columns={"index": "year", "variable": 'iso_code', 'value': 'top_20'})
bottom_20_melted = bottom_20.reset_index().melt(id_vars='index').rename(columns={"index": "year", "variable": 'iso_code', 'value': 'bottom_20'})
bottom_10_melted = bottom_10.reset_index().melt(id_vars='index').rename(columns={"index": "year", "variable": 'iso_code', 'value': 'bottom_10'})
edu_expense_melted = edu_expense.reset_index().melt(id_vars='index').rename(columns={"index": "year", "variable": 'iso_code', 'value': 'edu_expense'})
edu_expense_per_capita_melted = edu_expense_per_capita.reset_index().melt(id_vars='index').rename(columns={"index": "year", "variable": 'iso_code', 'value': 'edu_expense_per_capita'})
q2_df = pd.merge(gini_melted, top_10_melted, on=['iso_code', 'year'], how='outer')
q2_df = pd.merge(q2_df, top_20_melted, on=['iso_code', 'year'], how='outer')
q2_df = pd.merge(q2_df, bottom_20_melted, on=['iso_code', 'year'], how='outer')
q2_df = pd.merge(q2_df, bottom_10_melted, on=['iso_code', 'year'], how='outer')
q2_df = pd.merge(q2_df, edu_expense_melted, on=['iso_code', 'year'], how='outer')
q2_df = pd.merge(q2_df, edu_expense_per_capita_melted, on=['iso_code', 'year'], how='outer')
q2_df_raw = q2_df
q2_df = q2_df.dropna().reset_index(drop=True)
q2_df
| year | iso_code | gini_index | top_10 | top_20 | bottom_20 | bottom_10 | edu_expense | edu_expense_per_capita | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 2000-01-01 | AGO | 51.9 | 40.2 | 56.1 | 3.2 | 1.0 | 2.380569e+08 | 14.520925 |
| 1 | 2018-01-01 | AGO | 51.3 | 39.6 | 55.6 | 3.8 | 1.3 | 1.624767e+09 | 51.953409 |
| 2 | 1996-01-01 | ALB | 27.0 | 20.7 | 36.0 | 9.2 | 3.9 | 9.866124e+07 | 31.142744 |
| 3 | 2002-01-01 | ALB | 31.7 | 25.3 | 40.3 | 8.4 | 3.5 | 1.355641e+08 | 44.432523 |
| 4 | 2005-01-01 | ALB | 30.6 | 24.3 | 39.2 | 8.4 | 3.5 | 2.642329e+08 | 87.741682 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 1613 | 2010-01-01 | ZMB | 52.0 | 41.0 | 57.5 | 4.1 | 1.6 | 6.922715e+08 | 50.193386 |
| 1614 | 2015-01-01 | ZMB | 55.8 | 43.4 | 60.2 | 3.1 | 1.2 | 9.827264e+08 | 60.482059 |
| 1615 | 2022-01-01 | ZMB | 51.5 | 39.1 | 56.4 | 3.9 | 1.5 | 1.044938e+09 | 52.200782 |
| 1616 | 2011-01-01 | ZWE | 43.2 | 33.8 | 49.7 | 5.8 | 2.5 | 7.160955e+08 | 54.975230 |
| 1617 | 2017-01-01 | ZWE | 44.3 | 34.8 | 51.1 | 6.0 | 2.5 | 1.023226e+09 | 69.366084 |
1618 rows × 9 columns
The figure below depicts a heatmap of the Pearson Correlation Coefficients between variables in our analysis
plt.figure(figsize=(8, 6))
sns.heatmap(q2_df.iloc[:, 2:].corr(), annot=True, fmt=".2f", cmap='coolwarm')
plt.title('Heatmap of Correlation Between Numerical Variables')
plt.show()
Notice how edu_expense_per_capita has a somewhat strong negative correlation with gini_index (along with top_10 and top_20). Although this is definitely enough to claim that Educational Expenditure (per Capita) plays a role in reducing income inequality, let us delve into this deeper.
The graph below depicts a scatterplot of GNI Index against Educational Expenditure (per Capita). The graph also shows the corresponding linear regression line in red.
fig = sns.regplot(data=q2_df, x='edu_expense_per_capita', y='gini_index', marker='.', line_kws={"color": "#FF6666"})
plt.ylabel("Gini Index")
plt.xlabel("Edu Expenditure per Capita")
plt.title("Gini Index vs Educational Expenditure")
plt.show()
Note the quartile values of Gini Index, listed below.
q2_df.describe()[['gini_index']]
| gini_index | |
|---|---|
| count | 1618.000000 |
| mean | 36.674845 |
| min | 20.700000 |
| 25% | 30.400000 |
| 50% | 34.800000 |
| 75% | 41.575000 |
| max | 64.800000 |
| std | 8.357212 |
There are now 2 things to note from the scatterplot. Firstly, note how all the countries with Gini Index greater that the 3rd Quartile have very low Educational Expenditure per Capita. Secondly, we notice that these values deviate significantly from our regression line. We will now proceed to investigate.
The code below analyses the countries with Gini Index greater that the 3rd Quartile, and Educational Expenditure per Capita less that $800 USD.
outliers = q2_df[(q2_df.gini_index > q2_df.gini_index.quantile(0.75)) & (q2_df.edu_expense_per_capita < 800)].groupby(['iso_code'])[['gini_index', 'edu_expense_per_capita']].mean()
outliers['GDP_per_capita'] = outliers.index.map(global_gdp_per_capita.mean(axis=0).to_dict())
outliers['GDP_per_capita'].plot(kind='hist')
plt.title('Distribution of GDP per Capita among Outliers')
plt.xlabel('GDP per Capita (USD)')
plt.ylabel('Number of Countries')
plt.show()
We see that most of such outliers are underdeveloped countries, with very low GDP per Capita. Such underdeveloped countries are subject to high volatility. Additionally, many other factors are at play in determining their Gini Index, such as government corruption, resources. Consequently, such countries may deviate quite significantly from our regression line.
- A country's Educational Expenditure per Capita has a strong negative correlation on its Gini Index
- All countries with Gini Index greater than the 75th percentile have very low Educational Expenditure per Capita
In this section, we aim to address the question, "What predictive trends can be identified regarding the future impact of current education spending in Singapore on the economic well-being of its population?". To achieve this, we will forecast two key variables: GDP Growth (Annual %) and Gini Index.
The (Multiple Linear Regression Model) MLM for GDP Growth (Annual %) will incorporate Population, Educational Expenditure as a percent of GDP, Educational Expenditure per Capita, GDP per Capita as predictors. By providing inputs for Educational Expenditure as a percent of GDP, we will explore how it influences Singapore’s future GDP. Furthermore, based on our findings from Question 1, we will exclude countries with very low Population or GDP per Capita from the training data to enhance the robustness of the model.
The (Linear Regression Model) LM for Gini Index will focus on predicting income inequality, using Educational Expenditure per Capita as a key predictor, in line with the insights derived from our analysis in Question 2.
Firstly, lets combine the datasets from Questions 1 and 2 so that we have all the variable necessary to create our model in a single DataFrame. Note that we are combining the raw datasets, as we cannot afford to lose all the rows with missing values while training our model.
q3_df_raw = q1_df_raw.rename(columns={"country_code": "iso_code"}).drop(['edu_expense', 'edu_expense_per_capita'], axis=1).merge(q2_df_raw, on=['iso_code', 'year'], how='outer')
q3_df_raw
| year | iso_code | population | GDP | GDP_per_capita | edu_expense_per_GDP | country_name | gini_index | top_10 | top_20 | bottom_20 | bottom_10 | edu_expense | edu_expense_per_capita | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1960-01-01 | ABW | 54608.0 | NaN | NaN | NaN | Aruba | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 1 | 1961-01-01 | ABW | 55811.0 | NaN | NaN | NaN | Aruba | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 2 | 1962-01-01 | ABW | 56682.0 | NaN | NaN | NaN | Aruba | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 3 | 1963-01-01 | ABW | 57475.0 | NaN | NaN | NaN | Aruba | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 4 | 1964-01-01 | ABW | 58178.0 | NaN | NaN | NaN | Aruba | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 13755 | 2019-01-01 | ZWE | 15354608.0 | 2.183223e+10 | 1421.868596 | NaN | Zimbabwe | 50.3 | NaN | NaN | NaN | NaN | NaN | NaN |
| 13756 | 2020-01-01 | ZWE | 15669666.0 | 2.150970e+10 | 1372.696674 | NaN | Zimbabwe | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 13757 | 2021-01-01 | ZWE | 15993524.0 | 2.837124e+10 | 1773.920411 | NaN | Zimbabwe | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 13758 | 2022-01-01 | ZWE | 16320537.0 | 2.736663e+10 | 1676.821489 | NaN | Zimbabwe | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 13759 | 2023-01-01 | ZWE | 16665409.0 | 2.653827e+10 | 1592.416574 | NaN | Zimbabwe | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
13760 rows × 14 columns
print("\n\033[1mOriginal dataset\033[0m\n")
q3_df_raw.info()
print("\n\033[1mDataset after dropping rows with null values\033[0m\n")
q3_df_raw.dropna().info()
Original dataset <class 'pandas.core.frame.DataFrame'> RangeIndex: 13760 entries, 0 to 13759 Data columns (total 14 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 year 13760 non-null datetime64[ns] 1 iso_code 13760 non-null object 2 population 13730 non-null float64 3 GDP 11113 non-null float64 4 GDP_per_capita 11113 non-null float64 5 edu_expense_per_GDP 4927 non-null float64 6 country_name 13760 non-null object 7 gini_index 2099 non-null float64 8 top_10 2098 non-null float64 9 top_20 2098 non-null float64 10 bottom_20 2098 non-null float64 11 bottom_10 2098 non-null float64 12 edu_expense 4863 non-null float64 13 edu_expense_per_capita 4863 non-null float64 dtypes: datetime64[ns](1), float64(11), object(2) memory usage: 1.5+ MB Dataset after dropping rows with null values <class 'pandas.core.frame.DataFrame'> Index: 1618 entries, 168 to 13753 Data columns (total 14 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 year 1618 non-null datetime64[ns] 1 iso_code 1618 non-null object 2 population 1618 non-null float64 3 GDP 1618 non-null float64 4 GDP_per_capita 1618 non-null float64 5 edu_expense_per_GDP 1618 non-null float64 6 country_name 1618 non-null object 7 gini_index 1618 non-null float64 8 top_10 1618 non-null float64 9 top_20 1618 non-null float64 10 bottom_20 1618 non-null float64 11 bottom_10 1618 non-null float64 12 edu_expense 1618 non-null float64 13 edu_expense_per_capita 1618 non-null float64 dtypes: datetime64[ns](1), float64(11), object(2) memory usage: 189.6+ KB
Clearly, there are a lot of missing values. Additionally, we cannot afford to drop all the missing rows as that results in a huge loss of available data. However, since these are socio-economic indicators, they are very unlikely to vary much within a decade. Therefore, we will employ Forward Fill to fill in these missing values, till a limit of 10 entries. Afterwards, we will melt and merge these new datasets to get the dataset we need to train our model.
gini_filled = gini.ffill(limit=10).reset_index().rename(columns={"index": "year"}).melt(id_vars='year').rename(columns={'variable': 'iso_code', 'value': 'gini_index'})
global_pop_filled = global_pop.ffill(limit=10).reset_index().rename(columns={"index": "year"}).melt(id_vars='year').rename(columns={'variable': 'iso_code', 'value': 'population'})
global_gdp_filled = global_gdp.ffill(limit=10).reset_index().rename(columns={"index": "year"}).melt(id_vars='year').rename(columns={'variable': 'iso_code', 'value': 'gdp'})
global_gdp_per_capita_filled = global_gdp_per_capita.ffill(limit=10).reset_index().rename(columns={"index": "year"}).melt(id_vars='year').rename(columns={'variable': 'iso_code', 'value': 'gdp_per_capita'})
global_edu_expense_percent_gdp_filled = global_edu_expense_percent_gdp.ffill(limit=10).reset_index().rename(columns={"index": "year"}).melt(id_vars='year').rename(columns={'variable': 'iso_code', 'value': 'edu_expense_percent_gdp'})
top_10_filled = top_10.ffill(limit=10).reset_index().rename(columns={"index": "year"}).melt(id_vars='year').rename(columns={'variable': 'iso_code', 'value': 'top_10_income_share'})
top_20_filled = top_20.ffill(limit=10).reset_index().rename(columns={"index": "year"}).melt(id_vars='year').rename(columns={'variable': 'iso_code', 'value': 'top_20_income_share'})
bottom_20_filled = bottom_20.ffill(limit=10).reset_index().rename(columns={"index": "year"}).melt(id_vars='year').rename(columns={'variable': 'iso_code', 'value': 'bottom_20_income_share'})
bottom_10_filled = bottom_10.ffill(limit=10).reset_index().rename(columns={"index": "year"}).melt(id_vars='year').rename(columns={'variable': 'iso_code', 'value': 'bottom_10_income_share'})
edu_expense_filled = edu_expense.ffill(limit=10).reset_index().rename(columns={"index": "year"}).melt(id_vars='year').rename(columns={'variable': 'iso_code', 'value': 'edu_expense'})
edu_expense_per_capita_filled = edu_expense_per_capita.ffill(limit=10).reset_index().rename(columns={"index": "year"}).melt(id_vars='year').rename(columns={'variable': 'iso_code', 'value': 'edu_expense_per_capita'})
global_gdp_growth_percent_filled = global_gdp_growth_percent.ffill(limit=10).reset_index().rename(columns={"index": "year"}).melt(id_vars='year').rename(columns={'variable': 'iso_code', 'value': 'gdp_growth'})
The code below prepares the data for the GDP per Capita MLM. We drop the values with null entries after our forward-fill procedure, along with filtering out the countries with low Population or GDP per Capita
gdp_per_capita_mlm_data = global_pop_filled.merge(global_edu_expense_percent_gdp_filled, on=['iso_code', 'year'], how='outer') \
.merge(global_gdp_per_capita_filled, on=['iso_code', 'year'], how='outer') \
.merge(global_gdp_growth_percent_filled, on=['iso_code', 'year'], how='outer') \
.merge(edu_expense_per_capita_filled, on=['iso_code', 'year'], how='outer')
gdp_per_capita_mlm_data.dropna(inplace=True)
gdp_per_capita_mlm_data = gdp_per_capita_mlm_data[(gdp_per_capita_mlm_data.population >= 5e6) & (gdp_per_capita_mlm_data.gdp_per_capita >= 5e3)].reset_index(drop=True)
gdp_per_capita_mlm_data
| year | iso_code | population | edu_expense_percent_gdp | gdp_per_capita | gdp_growth | edu_expense_per_capita | |
|---|---|---|---|---|---|---|---|
| 0 | 2012-01-01 | AGO | 25188292.0 | 3.08200 | 5083.826873 | 8.542107 | 156.683545 |
| 1 | 2013-01-01 | AGO | 26147002.0 | 3.87000 | 5061.349253 | 4.954613 | 195.874210 |
| 2 | 2014-01-01 | AGO | 27128337.0 | 2.93000 | 5011.984427 | 4.822559 | 146.851147 |
| 3 | 2019-01-01 | ARE | 9211657.0 | 3.86021 | 45376.170839 | 1.108348 | 1751.615458 |
| 4 | 2020-01-01 | ARE | 9287289.0 | 3.98418 | 37629.174168 | -4.957052 | 1499.214021 |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 1527 | 2019-01-01 | ZAF | 58087055.0 | 5.92770 | 6702.526617 | 0.259936 | 397.305673 |
| 1528 | 2020-01-01 | ZAF | 58801927.0 | 6.18336 | 5753.066494 | -5.963358 | 355.732818 |
| 1529 | 2021-01-01 | ZAF | 59392255.0 | 6.56206 | 7073.612754 | 4.703062 | 464.174705 |
| 1530 | 2022-01-01 | ZAF | 59893885.0 | 6.18348 | 6766.481254 | 1.910406 | 418.404001 |
| 1531 | 2023-01-01 | ZAF | 60414495.0 | 6.60400 | 6253.161613 | 0.601662 | 412.958799 |
1532 rows × 7 columns
The code below prepares the data for the Gini Index LM. We drop the values with null entries after our forward-fill procedure.
gini_index_mlm_data = edu_expense_per_capita_filled.merge(gini_filled, on=['iso_code', 'year'], how='outer')
gini_index_mlm_data = gini_index_mlm_data.dropna().reset_index(drop=True)
gini_index_mlm_data
| year | iso_code | edu_expense_per_capita | gini_index | |
|---|---|---|---|---|
| 0 | 2000-01-01 | AGO | 14.520925 | 51.9 |
| 1 | 2001-01-01 | AGO | 14.520925 | 51.9 |
| 2 | 2002-01-01 | AGO | 14.520925 | 51.9 |
| 3 | 2003-01-01 | AGO | 14.520925 | 51.9 |
| 4 | 2004-01-01 | AGO | 14.520925 | 51.9 |
| ... | ... | ... | ... | ... |
| 4203 | 2019-01-01 | ZWE | 46.529246 | 50.3 |
| 4204 | 2020-01-01 | ZWE | 46.529246 | 50.3 |
| 4205 | 2021-01-01 | ZWE | 46.529246 | 50.3 |
| 4206 | 2022-01-01 | ZWE | 46.529246 | 50.3 |
| 4207 | 2023-01-01 | ZWE | 46.529246 | 50.3 |
4208 rows × 4 columns
The cell below creates a Multiple Linear Regression Model (MLM) for predicting GDP Growth based on Population, Educational Expenditure (percent GDP), Educational Expenditure (per Capita) and GDP per Capita
gdp_per_capita_mlm = LinearRegression()
X_train_gdp, X_test_gdp, y_train_gdp, y_test_gdp = train_test_split(gdp_per_capita_mlm_data[['population', 'edu_expense_percent_gdp', 'gdp_per_capita', 'edu_expense_per_capita']], gdp_per_capita_mlm_data['gdp_growth'], test_size=0.2, random_state=0)
gdp_per_capita_mlm.fit(X_train_gdp, y_train_gdp)
LinearRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LinearRegression()
coefficients = gdp_per_capita_mlm.coef_
intercept = gdp_per_capita_mlm.intercept_
# Print the equation for GDP growth
equation = "gdp_growth = {:.2e} * population + {:.2e} * edu_expense_percent_gdp + {:.2e} * gdp_per_capita + {:.2e} * edu_expense_per_capita + {:.4f}".format(
coefficients[0], coefficients[1], coefficients[2], coefficients[3], intercept)
print("The equation for GDP growth is:")
print(equation)
The equation for GDP growth is: gdp_growth = 1.94e-09 * population + -4.21e-01 * edu_expense_percent_gdp + -2.45e-05 * gdp_per_capita + 2.03e-04 * edu_expense_per_capita + 4.8444
Below is a scatterplot displaying predicted values of GDP Growth against actual values of GDP Growth
y_hat_gdp = gdp_per_capita_mlm.predict(X_test_gdp)
plt.scatter(y_test_gdp, y_hat_gdp, marker='.')
plt.xlabel('Actual')
plt.ylabel('Predicted')
plt.title('Predicted vs Actual')
plt.show()
From the scatterplot, we can see that there are quite a few outliers, with GDP Growth values below -5 or above 10. Such values are quite definitely outliers. Although GDP Growth is definitely subject to much volatility, as a result of wars, natural disasters, etc., we definitely cannot take these values into account in our model.
The cell below display the Mean Squared Error (MSE) of this model. Let us now proceed to remove the outliers.
print("MSE: ", mean_squared_error(y_test_gdp, y_hat_gdp))
MSE: 11.769396860956444
gdp_per_capita_mlm_data = gdp_per_capita_mlm_data[abs(gdp_per_capita_mlm_data.gdp_growth) <= 5]
gdp_per_capita_mlm = LinearRegression()
X_train_gdp, X_test_gdp, y_train_gdp, y_test_gdp = train_test_split(gdp_per_capita_mlm_data[['population', 'edu_expense_percent_gdp', 'gdp_per_capita', 'edu_expense_per_capita']], gdp_per_capita_mlm_data['gdp_growth'], test_size=0.2, random_state=0)
gdp_per_capita_mlm.fit(X_train_gdp, y_train_gdp)
y_hat_gdp = gdp_per_capita_mlm.predict(X_test_gdp)
plt.scatter(y_test_gdp, y_hat_gdp, marker='.')
plt.xlabel('Actual')
plt.ylabel('Predicted')
plt.show()
From the above graph, we can see that our model has improved considerably. Additionally, the MSE (displayed below) has reduced. Therefore, let us proceed with the latter model.
print("MSE: ", mean_squared_error(y_test_gdp, y_hat_gdp))
MSE: 5.4142837248373885
Let us take a look at the Residual Plots for our latest model.
residuals = y_test_gdp - y_hat_gdp
plt.figure(figsize=(8,6))
sns.scatterplot(x=y_hat_gdp, y=residuals)
plt.axhline(y=0, color='r', linestyle='--') # Add a horizontal line at y=0
plt.xlabel('Predicted Values')
plt.ylabel('Residuals')
plt.title('Residual Plot')
plt.show()
fig, axs = plt.subplots(2, 2, figsize=(15, 10))
axs = axs.flatten()
y_pred_gdp = gdp_per_capita_mlm.predict(X_test_gdp)
residuals_gdp = y_test_gdp - y_pred_gdp
for i, ax in enumerate(axs):
if i < X_test_gdp.shape[1]:
ax.scatter(X_test_gdp.iloc[:, i], residuals_gdp, color='blue', alpha=0.6)
ax.axhline(y=0, color='r', linestyle='--')
ax.set_xlabel(X_test_gdp.columns[i])
ax.set_ylabel('Residuals')
ax.set_title(f'Residuals vs. {X_test_gdp.columns[i]}')
plt.tight_layout()
plt.show()
All of the residual plots above are fairly evenly distributed. However, take note of the y-axis. the negative residuals are more sparse but have higher deviation, while the positive residuals are more common, but have lower deviation.
The cell below initiates a Linear Regression Model (LM) to predict the Gini Index of a country based on its Educational Expenditure per Capita.
gini_lm = LinearRegression()
X_train_gini, X_test_gini, y_train_gini, y_test_gini = train_test_split(gini_index_mlm_data[['edu_expense_per_capita']], gini_index_mlm_data['gini_index'], test_size=0.2, random_state=0)
gini_lm.fit(X_train_gini, y_train_gini)
LinearRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LinearRegression()
gini_coefficient = gini_lm.coef_[0]
gini_intercept = gini_lm.intercept_
gini_equation = "gini_index = {:.2e} * edu_expense_per_capita + {:.2f}".format(gini_coefficient, gini_intercept)
print("The equation for predicting Gini index is:")
print(gini_equation)
The equation for predicting Gini index is: gini_index = -3.42e-03 * edu_expense_per_capita + 40.67
The equation above depicts how (according to the LM) a higher Educational Expenditure per Capita gradually decreases the Gini Index. The graph below shows a scatterplot of predicted Gini Index values vs actual ones
y_hat_gini = gini_lm.predict(X_test_gini)
plt.scatter(y_test_gini, y_hat_gini, marker='.')
plt.plot([20, 40], [20, 40], color="red")
plt.xlabel('Actual')
plt.ylabel('Predicted')
plt.show()
print("Mean Squared Error: ", mean_squared_error(y_hat_gini, y_test_gini))
print("R-squared value: ", gini_lm.score(X_train_gini, y_train_gini))
Mean Squared Error: 69.04775961971771 R-squared value: 0.14890953856695732
The scatterplot above depicts how most of the deviation from our model occurs when the actual Gini Index is quite high (above the 3rd Quartile). However, as we had discussed in Question 2, only countries that are underdeveloped have Gini Index above the 3rd Quartile. Therefore, let us remove these countries from the dataset and recreate our model.
gini_index_mlm_data = gini_index_mlm_data[gini_index_mlm_data['gini_index'] <= gini_index_mlm_data['gini_index'].quantile(0.75)]
gini_lm = LinearRegression()
X_train_gini, X_test_gini, y_train_gini, y_test_gini = train_test_split(gini_index_mlm_data[['edu_expense_per_capita']], gini_index_mlm_data['gini_index'], test_size=0.2, random_state=0)
gini_lm.fit(X_train_gini, y_train_gini)
y_hat_gini = gini_lm.predict(X_test_gini)
plt.scatter(y_test_gini, y_hat_gini, marker='.')
plt.plot([22, 36], [22, 36], color='red')
plt.xlabel('Actual')
plt.ylabel('Predicted')
plt.show()
print("Mean Squared Error: ", mean_squared_error(y_hat_gini, y_test_gini))
print("R-squared value: ", gini_lm.score(X_train_gini, y_train_gini))
Mean Squared Error: 25.653538915221688 R-squared value: 0.16104166357384508
Clearly, this improves our model considerably. Therefore, let us proceed with this improved model.
The below displays a residual plot
sns.residplot(x=gini_index_mlm_data['edu_expense_per_capita'], y=gini_index_mlm_data['gini_index'])
plt.xlabel('Educational Expenditure per Capita')
plt.ylabel('Residuals')
plt.title('Residual Plot')
plt.show()
Note that when Educational Expenditure per Capita increases, the residuals tend to be positive. Therefore, we can infer that developed countries will probably have lower Gini Index than the one that we predict.
In this section, we will apply the models previously developed and try to forecast Singapore's Economic Welfare based on different Educational Expenditure policies. We will do this by forecasting 2 key variables, GDP and Gini Index, for the next 50 years.
We will first predict GDP using our MLM, and then use the predicted Educational Expenditure per Capita to predict the Gini Index using our LM.
sgp_data = pd.merge(global_gdp_per_capita.loc[:, 'SGP'].rename('gdp_per_capita'), global_pop.loc[:, 'SGP'].rename('population'), left_index=True, right_index=True)
sg_edu_expense_per_capita = sg_edu_expense['Expenditure On Education'] / global_pop.loc[:, 'SGP']
sgp_data = pd.merge(sgp_data, sg_edu_expense_per_capita.rename("edu_expense_per_capita"), left_index=True, right_index=True)
sg_edu_expense_per_gdp = sg_edu_expense['Expenditure On Education'] / sg_gdp['GDP'] * 100
sgp_data = pd.merge(sgp_data, sg_edu_expense_per_gdp.rename("edu_expense_percent_gdp"), left_index=True, right_index=True)
sgp_data = sgp_data[['population', 'edu_expense_percent_gdp', 'gdp_per_capita', 'edu_expense_per_capita']]
sgp_data
| population | edu_expense_percent_gdp | gdp_per_capita | edu_expense_per_capita | |
|---|---|---|---|---|
| 1960-01-01 | 1646400.0 | NaN | 428.056183 | 29.090343 |
| 1961-01-01 | 1702400.0 | NaN | 449.148137 | 34.160280 |
| 1962-01-01 | 1750200.0 | NaN | 472.082740 | 42.244875 |
| 1963-01-01 | 1795000.0 | NaN | 511.202235 | 46.881694 |
| 1964-01-01 | 1841600.0 | NaN | 485.530686 | 50.922361 |
| ... | ... | ... | ... | ... |
| 2019-01-01 | 5703569.0 | 2.463315 | 66081.719924 | 1739.990395 |
| 2020-01-01 | 5685807.0 | 2.523389 | 61466.803676 | 1681.851906 |
| 2021-01-01 | 5453566.0 | 2.330660 | 79601.412962 | 1846.448405 |
| 2022-01-01 | 5637022.0 | 1.942417 | 88428.702423 | 1807.197737 |
| 2023-01-01 | 5917648.0 | NaN | 84734.255921 | NaN |
64 rows × 4 columns
The graph below depicts a Small Multiples plot of the data above.
nrow=2
ncol=2
fig, axes = plt.subplots(nrow, ncol,figsize=(16,12))
titles = ['Population', 'Edu Expense (% GDP)', 'GDP per Capita', 'Edu Expense per Capita']
# plot counter
count=0
for r in range(nrow):
for c in range(ncol):
sgp_data[sgp_data.columns[count]].plot(kind='line', ax=axes[r,c])
axes[r,c].set_title(titles[count]) #give every subplot a title
count+=1
plt.suptitle("Trends of Various Indicators in SGP over Iime") #You can add a title to the entire figure with the suptitle() function
plt.tight_layout()
plt.show()
All of the graphs above show a very clear (and similar) upwards trend, with the exception of Educational Expenditure (as a percent of GDP). Note that Singapore's Educational Expenditure (as a percent of GDP) is relatively low compared to other countries, based on our findings in Question 1. Therefore, we will adjust this value and forecast Singapore's future GDP using our MLM.
Below is a graph of Singapore's population growth year on year.
fig = (sgp_data['population'].pct_change() * 100).plot(kind='line')
mean_sgp_pop_growth = (sgp_data['population'].pct_change() * 100).mean()
last_15_sgp_pop_growth = (sgp_data['population'].pct_change()[-15:] * 100).mean()
plt.axhline(y=mean_sgp_pop_growth, color='r', linestyle='--')
plt.axhline(y=last_15_sgp_pop_growth, color='g', linestyle='--')
plt.text(54, mean_sgp_pop_growth-0.1, f'Mean Overall: {mean_sgp_pop_growth:.2f}%', color='r', fontsize=12)
plt.text(54, last_15_sgp_pop_growth-0.2, f'Last 15 Years: {last_15_sgp_pop_growth:.2f}%', color='g', fontsize=12)
plt.title('SGP Population Growth (Annual %)')
plt.show()
For the purposes of our model, we will assume that Singapore's Population Growth remains relatively unaffected by GDP. However, note that Therefore, we will model the population as having a steady growth of ~2% (as shown above) year on year.
Here is a detailed explanation of how our model will operate, along with the reasoning behind each step. First, we will set a value for Educational Expenditure as a percentage of GDP, which reflects the proportion of the national budget allocated to education.
Next, we will model the projected population growth for the next 50 years.
Our Multiple Linear Regression Model (MLM) will then come into play to predict GDP growth for each year, based on several critical variables:
- Previous year's Population
- Educational Expenditure as a percentage of GDP
- Educational Expenditure per Capita
- GDP per Capita
Once the model calculates the projected GDP Growth for a given year, we will use this growth rate to estimate the next year's GDP per Capita and Educational Expenditure per Capita.
This process will be repeated iteratively for each year over a 50-year period, allowing us to simulate how changes in education spending and population growth could influence Singapore's economic well-being over the long term.
The below cell describes the function for predicting Singapore's GDP trajectory based on an input variable of Educational Expenditure (percent GDP). A sample trajectory table is provided, with an input Educational Expenditure (percent GDP) of 4%
def predict_sgp(preset_edu_expense_per_gdp):
sgp_data_predicted = sgp_data.iloc[[-1]].copy()
sgp_data_predicted.loc[:, 'edu_expense_percent_gdp'] = preset_edu_expense_per_gdp
sgp_data_predicted.loc[:, 'edu_expense_per_capita'] = preset_edu_expense_per_gdp * sgp_data_predicted.gdp_per_capita / 100
for i in range(50):
variables = sgp_data_predicted.iloc[[-1]].copy()
new_gdp_growth = gdp_per_capita_mlm.predict(variables)
new_gdp = variables.iloc[0, 2] * variables.iloc[0, 0] * (1+new_gdp_growth/100)
new_pop = variables.iloc[0, 0] * (1+last_15_sgp_pop_growth/100)
new_gdp_per_capita = new_gdp / new_pop
new_edu_expense_per_capita = preset_edu_expense_per_gdp * new_gdp / (100 * new_pop)
sgp_data_predicted.loc[sgp_data_predicted.index[-1] + pd.DateOffset(years=1)] = [new_pop, preset_edu_expense_per_gdp, new_gdp_per_capita[0], new_edu_expense_per_capita[0]]
return sgp_data_predicted
predict_sgp(4).tail(5)
| population | edu_expense_percent_gdp | gdp_per_capita | edu_expense_per_capita | |
|---|---|---|---|---|
| 2069-01-01 | 1.106090e+07 | 4.0 | 89905.235088 | 3596.209404 |
| 2070-01-01 | 1.121233e+07 | 4.0 | 90004.570101 | 3600.182804 |
| 2071-01-01 | 1.136583e+07 | 4.0 | 90103.367836 | 3604.134713 |
| 2072-01-01 | 1.152143e+07 | 4.0 | 90201.629706 | 3608.065188 |
| 2073-01-01 | 1.167916e+07 | 4.0 | 90299.357141 | 3611.974286 |
The graph below depicts GDP per Capita (in 2073) vs. Edu Expenditure (percent GDP)
import matplotlib.pyplot as plt
# Define the preset values
preset_list = [0, 2, 4, 6, 8, 10, 12, 15, 20]
gdp_changes = []
# Collect the last value of gdp_per_capita for each preset
for preset in preset_list:
gdp_changes.append(predict_sgp(preset).gdp_per_capita.iloc[-1])
# Create the line chart
plt.figure(figsize=(10, 6))
plt.plot(preset_list, gdp_changes, marker='o', linestyle='-', color='b')
# Add labels and title
plt.xlabel('Edu Expenditure (percent GDP)')
plt.ylabel('GDP per Capita (in 2073)')
plt.title('GDP per Capita (in 2073) vs. Edu Expenditure (percent GDP)')
# Add a horizontal line showing the current GDP per Capita
plt.axhline(y=sgp_data.gdp_per_capita.iloc[-1], color='g', linestyle='--')
# Place a text annotation on the plot indicating the current GDP per Capita
plt.text(10, sgp_data.gdp_per_capita.iloc[-1] + 700, # Adjust position to avoid overlap
f'Current GDP per Capita: {sgp_data.gdp_per_capita.iloc[-1]:,.0f}',
color='g', fontsize=12)
# Optionally, add grid lines for better readability
plt.grid(True)
# Show the plot
plt.tight_layout()
plt.show()
Clearly, our model also predicts that GDP per Capita (50 years down the line) will rise when Edu Expenditure (percent GDP) rises. This confirms our initial findings.
Finally, lets plot a similar line chart of the predicted Gini Index in 50 years as we vary the Educational Expenditure (% GDP)
gini_indexes = []
preset_list = [0, 2, 4, 6, 8, 10]
for preset in preset_list:
gini_indexes.append(gini_lm.predict(predict_sgp(preset)[['edu_expense_per_capita']].iloc[[-1]]))
# Create the line chart
plt.figure(figsize=(10, 6))
plt.plot(preset_list, gini_indexes, marker='o', linestyle='-', color='b')
# Add labels and title
plt.xlabel('Edu Expenditure (percent GDP)')
plt.ylabel('Gini Index (in 2073)')
plt.title('Gini Index (in 2073) vs. Edu Expenditure (percent GDP)')
# Optionally, add grid lines for better readability
plt.grid(True)
# Show the plot
plt.tight_layout()
plt.show()
Clearly, the above graph also confirms our findings in Question 2, that when Eductional Expenditure (% GDP) increases, the predicted Gini Index decreases.
- As Educational Expenditure increases, predicted GDP per Capita in 2073 also increases.
- For every increase of 1% in Educational Expenditure (% GDP), the forecasted GDP per Capita in 2073 increases by $1382
- As Educational Expenditure increases, predicted Gini Index in 2073 decreases.
- For every increase of 1% in Educational Expenditure (% GDP), the forecasted Gini Index in 2073 decreases by 2
- At Singapore's current Educational Expenditure (% GDP) (2.15%), the forecasted GDP per Capita in 2073 is $88507 USD
- If Singapore were to spend the median Educational Expenditure (% GDP) (4%), the forecasted GDP per Capita in 2073 is $90299 USD
- At Singapore's current Educational Expenditure (% GDP) (2.15%), the forecasted Gini Index in 2073 is 32.5
- If Singapore were to spend the median Educational Expenditure (% GDP) (4%), the forecasted GDP per Capita in 2073 is 29.0
What is the correlation between government expenditure on education and GDP growth?
# Create a new figure object for the subplots
fig = plt.figure()
# Add a subplot (ax0) on the left side (1 row, 2 columns, 1st subplot)
ax0 = fig.add_subplot(1, 2, 1)
# Add a subplot (ax1) on the right side (1 row, 2 columns, 2nd subplot)
ax1 = fig.add_subplot(1, 2, 2)
# Create a box plot of the average global education expenditure (% of GDP) over the last 20 years
# Use the color blue and make the plot horizontal, then place it in the first subplot (ax0)
global_edu_expense_percent_gdp.iloc[-20:, :].mean().plot(kind='box', color='blue', vert=False, figsize=(20, 8), ax=ax0)
# Set the title for the first subplot (ax0)
ax0.set_title('Box Plot of Global Education Expenditure (% of GDP)')
# Set the x-axis label for the first subplot (ax0)
ax0.set_xlabel('Education Expenditure (% of GDP)')
# Create a histogram of the average global education expenditure (% of GDP) over the last 20 years
# Place the histogram in the second subplot (ax1)
global_edu_expense_percent_gdp.iloc[-20:, :].mean().plot(kind='hist', figsize=(20, 8), ax=ax1)
# Set the title for the second subplot (ax1)
ax1.set_title('Distribution of Global Education Expenditure (% of GDP)')
# Set the y-axis label for the second subplot (ax1)
ax1.set_ylabel('Number of Countries')
# Set the x-axis label for the second subplot (ax1)
ax1.set_xlabel('Education Expenditure (% of GDP)')
# Set a main title for the entire figure
fig.suptitle('Global Education Expenditure Analysis', fontsize=16)
# Display the figure with both subplots
plt.show()
From the graphs above, it is quite clear that the vast majority of countries spend between 3-5% of their GDP on Education. However, note that this also means that most countries GDP are 20-30 times their Education Expenditure. Above all, this shows how a countries GDP is really very closely tied to its Education Expenditure.
# Fill forward missing values in the global_edu_expense_percent_gdp dataframe to handle missing data
map_data = global_edu_expense_percent_gdp.ffill().reset_index()
# Reshape the data using melt to make it long-form for plotting
# 'index' becomes 'year', and country ISO codes will be the 'iso_code'
map_data = map_data.melt(id_vars='index').rename(columns={"variable": "iso_code", "index": "year"})
# Map the ISO country codes to their corresponding country names using a dictionary
map_data['country_name'] = map_data['iso_code'].map(country_codes_dict)
map_data = map_data[map_data.year.dt.year==2023]
# Convert the 'year' column to only include the year (dropping month/day if present)
map_data.year = map_data.year.dt.year
# Create an animated choropleth map using Plotly Express
px.choropleth(
map_data, # Input dataset
locations="iso_code", # The ISO codes to be used for location
color="value", # Column to represent in color scale (education expenditure as % of GDP)
hover_name="country_name", # The name to display when hovering over a country
color_continuous_scale=px.colors.diverging.RdBu, # Color scale for the choropleth
height=600, # Height of the figure
range_color=[3, 7], # Set the range for the color scale (3% to 7% for education expenditure)
title="Edu Expenditure (% GDP) in 2023"
)
From the graph , we can see that most developed countries, such as USA, Australia, Sweden, Finland, France all have high education expenditure (blue) whereas most underdeveloped countries, such as countries in Africa and Asia have low education expenditure (brown). Although there are definitely quite a few exceptions, this hints at a trend between how developed a country is and its Education Expenditure (% GDP)
mean_edu_expense = pd.DataFrame(global_edu_expense_percent_gdp.iloc[-10:, :].mean(axis=0), columns=['Mean Education Expenditure (% of GDP)'])
mean_gdp_per_capita = pd.DataFrame(global_gdp_per_capita.iloc[-10:, :].mean(axis=0), columns=['Mean GDP Per Capita'])
mean_pop = pd.DataFrame(global_pop.iloc[-10:, :].mean(axis=0), columns=['Mean Population'])
merged_df = pd.merge(mean_edu_expense, mean_gdp_per_capita, left_index=True, right_index=True).merge(mean_pop, right_index=True, left_index=True)
def pearson_coef(min_pop, min_gdp_per_capita):
return merged_df[(merged_df['Mean Population'] >= min_pop) & (merged_df['Mean GDP Per Capita'] >= min_gdp_per_capita)].corr().iloc[0, 1]
pop_ranges = range(10000000, 1000000, -1000000)
gdp_ranges = range(5000, 10001, 1000)
correlation_matrix = pd.DataFrame(index=pop_ranges, columns=gdp_ranges)
for min_pop in pop_ranges:
for min_gdp in gdp_ranges:
correlation_matrix.loc[min_pop, min_gdp] = pearson_coef(min_pop, min_gdp)
correlation_matrix = correlation_matrix.apply(pd.to_numeric, errors='coerce')
plt.figure(figsize=(12, 8))
sns.heatmap(correlation_matrix, annot=True, fmt=".2f", cmap='coolwarm', cbar=True,
xticklabels=[int(gdp / 1000) for gdp in gdp_ranges], yticklabels=[int(pop / 1e6) for pop in pop_ranges])
plt.title('Correlation between Education Expenditure (% GDP) and GDP per Capita')
plt.xlabel('Min GDP per Capita Threshold (Thousand USD)')
plt.ylabel('Min Population Threshold (Millions)')
plt.text(-1, 5, 'Bigger Countries', rotation='vertical', color='g', fontsize=20)
plt.annotate('', xy=(-0.1, 0.9), xycoords='axes fraction', xytext=(-0.1, 0.1),
arrowprops=dict(facecolor='g', arrowstyle='->', edgecolor='g', lw=2), fontsize=12, ha='center', va='center')
plt.text(2, 11, 'More Developed Countries', rotation='horizontal', color='g', fontsize=20)
plt.annotate('', xy=(0.9, -0.15), xycoords='axes fraction', xytext=(0.1, -0.15),
arrowprops=dict(facecolor='g', arrowstyle='->', edgecolor='g', lw=2), fontsize=12, ha='center', va='center')
plt.show()
From the graph, we can see that as we increase the Min Population Threshold and the Min GDP per capita Threshold, the correlation between Education Expenditure as a percent of GDP and GDP per Capita gets stronger and stronger. In other words, this confirms that as countries get Bigger and More Developed, their GDP Growth is ever more reliant on the Education Expenditure
To what extent does government expenditure on education reduce income inequality, as measured by the Gini coefficient?
mean_gini = gini.iloc[-10:, :].mean()
mean_top_10 = top_10.iloc[-10:, :].mean()
mean_top_20 = top_20.iloc[-10:, :].mean()
mean_bottom_20 = bottom_20.iloc[-10:, :].mean()
mean_bottom_10 = bottom_10.iloc[-10:, :].mean()
merged_mean = pd.DataFrame(data={"Gini Index": mean_gini, "Top 10%": mean_top_10, "Top 20%": mean_top_20, "Bottom 10%": mean_bottom_10, "Bottom 20%": mean_bottom_20})
fig = sns.violinplot(merged_mean)
plt.ylabel('Number of Countries')
plt.title("Boxplots of Income Inequality Indicators")
plt.show()
The above graph shows how there is indeed a very pressing problem of Income Inequality. The difference between the shares of the Top 10% and the Bottom 10% are quite vast. Should Education Expenditure prove to be a viable method of decreasing Income Inequality, the graph above makes it clear that investing in Education becomes all the more important.
plt.figure(figsize=(9, 7))
inv_color_dict = edu_expense_per_capita.iloc[-10:, :].mean().dropna().sort_values(ascending=False).reset_index().drop(0, axis=1).to_dict()['index']
keys = np.array(list(inv_color_dict.keys()))
values = np.array(list(inv_color_dict.values()))
color_dict = dict(zip(values, keys))
# Get a colormap
cmap = plt.get_cmap('coolwarm')
norm = plt.Normalize(0, len(color_dict))
# Plot each country's Gini index as a scatter plot, only if the country exists in color_dict
for country in gini.columns:
if country in color_dict: # Check if the country is in the color_dict
color_value = norm(color_dict[country]) # Normalize the country index
sns.scatterplot(x=gini.index.year, y=gini[country], color=cmap(color_value), label=country)
# Add a legend
plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left', ncol=6)
plt.ylabel("Gini Index")
plt.xlabel("Time")
plt.text(1965, 45, "Blue to Red", rotation='vertical', color='g', fontsize=20)
plt.arrow(1970, 40, 0, 25, color='g', width=0.7)
plt.show()
The dots in the graph above have been coloured such that countries with Low Educational Expenditure have been coloured Red, while the countries with High Educational Expenditure have been colored Blue. From the graph, we can see a clear trend of Red/Grey to Blue as we move downward, indicating how countries with Lower Gini Index tend to have Higher Educational Expenditure
import matplotlib.pyplot as plt
import matplotlib.patches as patches
import seaborn as sns
# Create subplots
fig, axes = plt.subplots(1, 2, figsize=(14, 6))
circle1 = patches.Ellipse((400, 55), 1000, 23, color='g', fill=False)
# First subplot - Gini Index vs Educational Expenditure
sns.regplot(data=q2_df, x='edu_expense_per_capita', y='gini_index', marker='.', line_kws={"color": "#FF6666"}, ax=axes[0])
axes[0].set_ylabel("Gini Index")
axes[0].set_xlabel("Edu Expenditure per Capita")
axes[0].set_title("Gini Index vs Educational Expenditure")
axes[0].add_patch(circle1)
axes[0].text(1100, 55, "Outliers ------------------------------>", color='g', fontsize=20)
# Second subplot - Distribution of GDP per Capita among Outliers
outliers = q2_df[(q2_df.gini_index > q2_df.gini_index.quantile(0.75)) & (q2_df.edu_expense_per_capita < 800)].groupby(['iso_code'])[['gini_index', 'edu_expense_per_capita']].mean()
outliers['GDP_per_capita'] = outliers.index.map(global_gdp_per_capita.mean(axis=0).to_dict())
outliers['GDP_per_capita'].plot(kind='hist', ax=axes[1])
axes[1].set_title('Distribution of GDP per Capita among Outliers')
axes[1].set_xlabel('GDP per Capita (USD)')
axes[1].set_ylabel('Number of Countries')
axes[1].text(3000, 15, "Very Low GDP Per Capita \n => Underdeveloped Countries", fontsize=15, color='g')
# Show the combined plot
plt.tight_layout()
plt.show()
The scatterplot above nicely displays how an increasing Education Expenditure certainly brings about a decreasing Gini Index. When we investigate the outliers, we discover that all of them are underdeveloped countries with very low GDP per Capita and a very low Education Expenditure. This further supports our findings that increasing Education Expenditure brings about a decreasing Gini Index.
What predictive trends can be identified regarding the future impact of current education spending in Singapore on the economical well-being of its population?
We will forecast two key variables: GDP Growth (Annual %) and Gini Index.
The (Multiple Linear Regression Model) MLM for GDP Growth (Annual %) will incorporate Population, Educational Expenditure as a percent of GDP, Educational Expenditure per Capita, GDP per Capita as predictors. By providing inputs for Educational Expenditure as a percent of GDP, we will explore how it influences Singapore’s future GDP. Furthermore, based on our findings from Question 1, we will exclude countries with very low Population or GDP per Capita from the training data to enhance the robustness of the model.
The (Linear Regression Model) LM for Gini Index will focus on predicting income inequality, using Educational Expenditure per Capita as a key predictor, in line with the insights derived from our analysis in Question 2.
Here is a detailed explanation of how our model will operate, along with the reasoning behind each step. First, we will set a value for Educational Expenditure as a percentage of GDP, which reflects the proportion of the national budget allocated to education.
Next, we will model the projected population growth for the next 50 years.
Our Multiple Linear Regression Model (MLM) will then come into play to predict GDP growth for each year, based on several critical variables:
- Previous year's Population
- Educational Expenditure as a percentage of GDP
- Educational Expenditure per Capita
- GDP per Capita
Once the model calculates the projected GDP Growth for a given year, we will use this growth rate to estimate the next year's GDP per Capita and Educational Expenditure per Capita.
This process will be repeated iteratively for each year over a 50-year period, allowing us to simulate how changes in education spending and population growth could influence Singapore's economic well-being over the long term.
import matplotlib.pyplot as plt
# Define the preset values
preset_list = [0, 2, 4, 6, 8, 10, 12, 15, 20]
gdp_changes = []
# Collect the last value of gdp_per_capita for each preset
for preset in preset_list:
gdp_changes.append(predict_sgp(preset).gdp_per_capita.iloc[-1])
# Create the line chart
plt.figure(figsize=(10, 6))
plt.plot(preset_list, gdp_changes, marker='o', linestyle='-', color='b')
# Add labels and title
plt.xlabel('Edu Expenditure (percent GDP)')
plt.ylabel('GDP per Capita (in 2073)')
plt.title('GDP per Capita (in 2073) vs. Edu Expenditure (percent GDP)')
# Add a horizontal line showing the current GDP per Capita
plt.axhline(y=sgp_data.gdp_per_capita.iloc[-1], color='g', linestyle='--')
# Place a text annotation on the plot indicating the current GDP per Capita
plt.text(10, sgp_data.gdp_per_capita.iloc[-1] + 700, # Adjust position to avoid overlap
f'Current GDP per Capita: {sgp_data.gdp_per_capita.iloc[-1]:,.0f}',
color='g', fontsize=12)
# Optionally, add grid lines for better readability
plt.grid(True)
# Show the plot
plt.tight_layout()
plt.show()
Clearly, our model predicts that GDP per Capita (50 years down the line) will rise when Edu Expenditure (percent GDP) rises. More concretely, for every increase of 1% in Educational Expenditure (% GDP), the forecasted GDP per Capita in 2073 increases by $1382. This shows how (at least according to this model), investing in Education has a big impact on Singapore's future GDP
gini_indexes = []
preset_list = [0, 2, 4, 6, 8, 10]
for preset in preset_list:
gini_indexes.append(gini_lm.predict(predict_sgp(preset)[['edu_expense_per_capita']].iloc[[-1]]))
# Create the line chart
plt.figure(figsize=(10, 6))
plt.plot(preset_list, gini_indexes, marker='o', linestyle='-', color='b')
# Add labels and title
plt.xlabel('Edu Expenditure (percent GDP)')
plt.ylabel('Gini Index (in 2073)')
plt.title('Gini Index (in 2073) vs. Edu Expenditure (percent GDP)')
# Optionally, add grid lines for better readability
plt.grid(True)
# Show the plot
plt.tight_layout()
plt.show()
Our model also predicts that Gini Index (50 years down the line) will decrease when Edu Expenditure (percent GDP) rises. More concretely, For every increase of 1% in Educational Expenditure (% GDP), the forecasted Gini Index in 2073 decreases by 2. This shows how (at least according to this model), investing in Education has a big impact on Singapore's future Gini Index
In this analysis, we explored the impact of government expenditure on education on key economic indicators such as GDP growth and income inequality. The findings indicate that government investment in education plays a pivotal role in driving economic growth and reducing inequality.
First, the strong positive correlation between a country’s expenditure on education and its GDP underscores the importance of prioritizing education as a means of enhancing national economic performance. As countries grow and develop, this relationship becomes even more pronounced, suggesting that education investment should scale alongside economic progress.
Second, the analysis of income inequality, as measured by the Gini Index, reveals that higher educational expenditure per capita is associated with a reduction in inequality. Countries with the highest levels of inequality tend to invest significantly less in education, further emphasizing the role of education in promoting a more equitable society.
Finally, the predictive trends for Singapore demonstrate that increasing educational expenditure not only boosts GDP per capita but also leads to a more equitable distribution of wealth in the future. Projections for 2073 show that even modest increases in education spending could result in significant economic gains and a reduction in income inequality. These findings highlight the potential long-term benefits of education investments and the critical role they play in shaping a nation’s economic future.
While the analysis and predictions show a high degree of accuracy, it is important to recognize that long-term success will depend on consistent and targeted educational spending. Future research could explore the integration of other social factors and investigate how the quality of education impacts these economic indicators. Expanding the scope to include more countries and exploring the psychological and social benefits of education could further enhance the understanding of its comprehensive impact on well-being.
In conclusion, investing in education is not just about fostering knowledge—it is a powerful economic tool that drives growth, reduces inequality, and builds a stronger, more equitable future. Governments should consider increasing their expenditure on education to maximize these benefits.
Areas for Improvement¶
Data Quality and Coverage: To make our findings more accurate, we should aim to gather more comprehensive data, especially for countries where educational and economic data are either incomplete or not widely available. By expanding the dataset to cover more countries and extending the time periods, we can strengthen the conclusions drawn from this analysis.
Control for Other Variables: While education expenditure is clearly important, other factors—like political stability, health spending, and labor market policies—also play a role in affecting GDP and inequality. It would make sense to conduct a multivariate analysis that accounts for these variables to provide a clearer understanding of how education spending alone impacts these economic indicators.
Country-Specific Analysis: Although global and regional trends offer valuable insights, drilling down to a more granular, country-specific level might reveal even more actionable insights for individual governments. Exploring how education quality, curriculum relevance, and infrastructure differ from country to country could help refine policy recommendations that are more targeted.
Disaggregate Gini Index Data: Currently, the Gini Index is treated as an aggregate in the analysis. To get a better grasp of how education impacts inequality, further research could break down the Gini Index by different factors like age, gender, and regions within countries. This could lead to more targeted interventions, especially in places where inequality is deeply divided along these lines.
Further Research¶
Investigate Long-Term Impact of Education Expenditure: It would be beneficial to explore how education spending impacts economies over a longer time frame. Longitudinal studies that track this relationship across decades could give us deeper insights into the delayed effects of education investment on both GDP and inequality. Understanding these time lags is crucial for refining long-term policy decisions.
Examine the Role of Educational Quality: Beyond just spending, it’s important to consider the quality of education itself. Metrics like student performance, literacy rates, and international testing scores could shed light on how improved educational outcomes at any given level of expenditure affect GDP and inequality.
Expand the Scope to Include Technological and Vocational Training: Another area worth exploring is the impact of vocational and technical education spending. As economies evolve and the demand for skilled labor in technical fields increases, it’s important to understand how this type of education affects overall economic outcomes.
Cross-Compare with Other Forms of Public Expenditure: It could be insightful to compare education spending with other areas of public expenditure, such as healthcare or infrastructure, to determine which type of investment yields the best returns in terms of economic growth and reducing inequality.
Forecasting Techniques: To enhance predictions of future trends, exploring machine learning models like time-series analysis or econometric modeling could provide more robust forecasts for how educational expenditure might affect GDP and the Gini Index moving forward.
- https://www.influxdata.com/time-series-forecasting-methods/#:~:text=Time%20series%20forecasting%20is%20a,Astronomy
- https://www.tableau.com/learn/articles/time-series-forecasting
- https://otexts.com/fpp2/data-methods.html
- https://en.wikipedia.org/wiki/Gini_coefficient
- https://data.worldbank.org/indicator/SI.POV.GINI
- https://databank.worldbank.org/metadataglossary/gender-statistics/series/SI.POV.GINI
- Give guidelines on how to build a predictive model in python for predicting GDP primarily based on governmental expenditure on education. I have the following datasets.
- Could you build an AI model instead? Would you recommend doing so?
- Implement the non-AI predictive models, using the following data
- How do we check for multicollinearity?
- How would you use time-series forecasting methods instead?
- Give me a list of time-series forecasting methods, and their pros and cons
- Give me a zero to hero course on prophet
- Can prophet handle predicting based on.multiple variables?
- what was 2024 1q gdp sg in sgd?
- you did not import pd
- Give me a diverging chloropleth colour scheme
- Think of at least 25 possible visualisations based on the following data. Note that all the data are across countries
- Give me code for a treemap
- Please web-scrape https://www.iban.com/country-codes
- Is it reasonable to say that countries with lower that 5000 usd GDP per capita are underdeveloped or developing?
- Give me visualisations for the below data, ranging across countries and years
- I have a df with countries as the columns and years as the index. I also have another df of GDP of a country. I want to plot a lineplot of the entries of the first df, across time. However, I also want to color the countries in descending GDP, from red to blue. How?
- convert these 2 to subplots in the same figure
- make the below plots 2 static subplots, one for gini and one for edu_expense
- I want the color bars to be shown sepatately
- is gini coeff higher in developed or developing countries
- how to give chloropleth graph a title?
- I want you to create 2 subplots, the right one displaying the figure below and the left one having q3_df.gini_index.describe()
- could you use .text() for the right one?
- how to write 3rd markdown
- no i meant with the rd on top
- Answer the following question with the attached data. What predictive trends can be identified regarding the future impact of current education spending in Singapore on the economical well-being of its population?
- perhaps you are using the old file path structure?
- Answer the following question assuming you have the below data. What predictive trends can be identified regarding the future impact of current education spending in Singapore on the economical well-being of its population?
- Using this dataset, create an mlm model to predict the gdp of singapore in the times to come
- Rewrite the below
- Rewrite the below.
- Rewrite the below, elaborating and justifying
- Add some text saying "Mean Overall Population Growth" and "Mean Population Growth in the last decade"
- what are all the libraries required for these. annwer in the form ! pip3 install ..
- can you give them in separate lines? and use ! pip install..
- Give an overview of what these indicators mean We will be using the following datasets for this question
- make far more brief
- For the following prompts, I will give you some code. DO NOT CHANGE THE CODE ITSELF, but add appropriate comments everywhere.